CYL- housy- 3-po2zas lt >. 


THE ANNALS aoe 


of 
STATISTICS 


AN OFFICIAL JOURNAL OF 
THE INSTITUTE OF MATHEMATICAL STATISTICS 


3 Invited Paper 
Influence functionals for time series . .’.:....R. DOUGLAS MARTIN AND VICTOR J. YOHAI 781 
Discussion .... 0.2.20 eee eee ETE E EE vans DAVID R. BRILLINGER 819 
DISCUSSION, eas a n E a Wome we O a :J. FRANKE AND E, J. HANNAN 822 
DISCUSSION: i soe Siete eye EEE A 8 9 Blo a BN S a Bie a Bis Hans R. KUnscu ~ 824 
Discussion scsi 6626'S his os eE 3h wa ow, Be Herel ROBERT B. MILLER AND JAE JUNE LEE 827 
DISCUBBION ec oc riesa ee ranan oraa ERE Sa EE E E E H. VINCENT Poor 829 
DISCUSSION 65.56 aeee 95 sere e944 Dies Ee R EE OaE PORRE P. M. ROBINSON 832 
DiIBctadon a re naa ea EN E ATE EAA ee E oe rae a alee Rury S. Tsay 835 
DISCUSSION 6.6%: eoe a aa ihe, AAEE E OE E is ale, NN A 6 SOET EDWARD J. WEGMAN 836 
Discussdon sie pea board ae sR i els 2b ae Rea E ER a WOO Ong E A A A MIKE WesrT 838 
RejOindh s e oid yaaa Sete E A os wee DA R. DoucLas MARTIN AND VICTOR J. YoHar 840 
Likelihood and observed geometries ...........00 0000 eee O. E. BARNDORFF-NIELSEN 856 
Rectangular lattice designs: Efficiency factors and analysis 
R.A. BAILEY AND T. P. SPEED 874 
y Conservative confidence bands in curvilinear regression.........0.. DANIEL Q. NAIMAN 896 
N Spherical TEgTESSİON ...eseseeosererosreroesossserrerseeeeeesree „TED CHANG 907 
Asymptotic conditional inference for the offspring mean of a 
supercritical Galton- Watson l process Saas O E AE ous N AE T. J. SWEETING 925 
Nonparametric Bayesian regression .. s... suue ideene ideii DANIEL BARRY 934 
Bayes rules for a clinical-trials trials model with Toae: responses ...... GORDON SIMONS 954 
An extreme value theory for sequence matching 
RICHARD ARRATIA, “Louis GORDON AND MICHAEL WATERMAN 971 
Skewness and asymmetry: Measures and orderings: -........-.. H. L. MACGILLIVRAY 994 
On the sre dae aot formula for the probability o a Type I error of 
Samay |e power one tests sess soo a ra EEE OAE N aa MosHE POLLAK 1012 
The pas of Bayes tests Of power ONE 6... cece eee eee ee eee Hans RUDOLF LERCHE 1030 
Very weak expansions for sequential confidence levels .......... MICHAEL WooDROOFE 1049 
Asym roperties of Neyman-Pearson tests for infinite : 
PAA ibler information ...... 00. cece etee ee eee reeves ARNOLD JANSSEN 1068 
Stochastic complexity and modeling ....... 0... cece eee eee eee eee Jorma RISSANEN 1080 
Asymptotic E Ee of C, and generalized cross- validation i in ridge regression 
with application to spline amoothing .... s.s.. pree neeorrrerers KER-CHAU Li 1101 
Asymptotically minimax estimators for distributions withi increasing , 
ailure rate reunaa ereen s s e e Ern E E E a a JaNE-Lina Wanc 1113 
Estimation of a unimodal distribution function .........-+-+eeee eee SyHaw-Hwa Lo 1182 
On asymptotically efficient estimation in semiparametric models ‘........ ANTON SCHICK 1139 
Asymptotic behavior of the empiric distribution of M-estimated, 
residuals from ion model with many parameters ....... STEPHEN PORTNOY 1152 
The use of subscris values for estimating the variance of | f 
a general statistic from a stationary C ele E ae wa ae ‘, EDWARD-CARLSTEIN , 1171 
A minimum distance estimator for first er autoregressive : 
PTOCERBER oc ee eee eee eee e cee e eee CHAMONT W. H. Wana 1180 
Minimum distance estimation and ALAT tests in ‘ 
first-order autoregrestion . 21... 2. eee eee eee ee eee eee es + HIRAL. KOUL 1194 
Second-order risk structure of GLSE and MLE‘in a regression with et 
a Linear PFOCess oe eee ee eee e ees YASUYURI TOYOOKA “1214 
Bayesian statistical inference for sampling a finite population chat aTa od Aaaa ALBERT-Y; Lo 1226 
Short Commimications 
‘The total time on test plot and the cumulative total time on test: : 
statistic for a counting process ...... 6. sees e eee e eee cece eee RICHARD D. GILL 1234 
Local convergence of empirical measures in the random censorship i 
situation with application to density and rate estimators ........ HELMUT SCHÄFER 1240 
Bahadur representations for robust scale estimators based on e 
regression residtials... 1.0.0... cece cect cent teen ee eeeee A. H. WELSH 1246 
On a converse to Scheffé’s theorem... 0... cc ccc c ecco cerca T. J. SWEETING / 1252 . 
The efficiency of Good’s nonparametric coverage estimator .......... WARREN W. Esty 1257 ` 


Vol. 14, No. 3—September 1986 





INSTITUTE OF MATHEMATICAL STATISTICS 
(Organized September 12, 1935) 


The purpose of the Institute of Mathematical Statistics 1s to encourage the 
development, dissemination, and application of mathematical statistics. 








OFFICERS AND EDITORS 
President: 


Ronald Pyke, Department of Mathematics GN-50, University of Washington, Seattle, 
Washington 98195 

President-Elect: 
Bradley Efron, Department of Statistics, Sequoia Hall, Stanford University, Stanford, 
California 94305 

Past President: 
Paul Meier, Department of Statistics, University of Chicago, 6734 University Avenue, 
Chicago, Ilinois 60637 

Executive Secretary: 
Francisco J. Samaniego, Division of Statistics, 469 Kerr Hall, University of Califorma, 
Davis, California 95616 


r: 
Nicholas P. Jewell, Group in Biostatistics, University of California, Berkeley. Please send 
correspondence to: | Business Office, 3401 Investment Boulevard #7, Hayward, 
California 94545 

Secretary: 
Lynne Billard, Department of Statistics and Computer Science, University of Georgia, 
Athens, Georgia 30602 

Editor: The Annals of Statistics 
Willem R. van Zwet, Department of Mathematics, University of Leiden, P.O. Box 9512, 
2300 RA Leiden, The Netherlands 

Editor: The Annals of Probability 
Thomas M. Liggett, Department of Mathematics, University of California, Los Angeles, 
California 90024 

Executive Editor: Statistical Science 
Morris H. DeGroot, Department of Statistics, Carnegie-Mellon University, Pittsburgh, 
Pennsylvania 15213 

Editor: The IMS Bulletin 
George P. H. Styan, De ent of Mathematics and Statistics, Burnside Hall, McGull 
University, 805 Sherbrooke Street West, Montreal PQ, Canada H3A 2K6 

Editor: The IMS Lecture Notes—Monograph Series 
Shanti S. Gupta, Department of Statıstıcs, Purdue University, West Lafayette, Indiana 


Managin g Editor: 
aul Shaman, Department of Statistics, Univermty of Pennsylvania, Philadelphia, 
Pennsylvania 19104 


Journals. The scientific journals of the Institute are The Annals of Statistics, The Annals of 
Probabuuty, and Statistical Science. The news organ of the Institute 1s The Institute of Mathe- 
matical Statistics Bulletin. 


Individual, Institutional, and Corporate Memberships. All individual members receive Sta- 
tistical Science and The IMS Bulletin for a basic annual dues rate of $30. Individual members 
1 ay elect to receive one Annals for an additional $10 or both Annals for an additional $20. Dues 
allocations to each journal are set by Council resolution. Of the total dues paid, $8 1s allocated to 
The IMS Bullean and the remaining amount ıs allocated equally among the scientific journals 
received. Memberships are available at a reduced rate (40% of regular rates) for full-time students, 
permanent residents of developing countries, and retired members. Retired members may also 
elect to receive the Bulletin only for $10. Institutional memberships are available to nonprofit 
organizations at $225 per year and corporate memberships are available to other organizations at 
$600 per year. Institutional and corporate memberships include two multiple-readership copies of 
all IMS journals in addition to other benefits specified for each category (details available from 
the IMS Business Office). 


Individual and General Subscriptions. Subscriptions are available on a calendar-year basis. 
For 1986, all subscriptions to one or both Annals automatically include one subscription to 
Statistical Science. Individual subscriptions are for the personal use of the subscriber and must be 
in the name on paia directly by, and mailed to an individual. Individual subscriptions for 1986 are 
available to both Annals and Statistical Sctence ($79), one Annals and Statistical Science ($52), 
Statistical Science only iM , and The IMS Bulletin ($15). General subscriptions are for libraries, 
institutions, and any multiple-readership use. General subscriptions for 1986 are available to both 
Annals and Statistical Sctence ($155), The Annals of Statistics and Statistical Scvence ($85), The 
Annals of Probability and Statistical Science ($80), Statistical Sctence only ($40), and The IMS 
Bulletin ($20). Air mail rates for overseas delivery of general subscriptions are $40 per title. 


Co mndence. Mail to IMS should be sent to the IMS Business Office (membership, subscrip- 
tions, claims, copyright permissions, advertising, back issues), the Editor of the appropriate 
journal (submissions, editorial content), or the Managing Editor (production). 

The Annals of Statistics (ISSN 0090-5364), Volume 14, Number 3, September 1986. Published 
in March, June, September, and December by the Institute of Mathematical Statistics, 3401 
Investment Boulevard #7, Hayward, California 94545. Second-class postage paid at Hayward, 
California and at additional mailing offices. Postmaster: Send address changes to The Annals 
a Sr croorars mas of Mathematical Statistics, 3401 Investment Boulevard #7, Hayward, 

ifornia 9 ; 


Copyright © 1986 by the Institute of Mathematical Statistics 
Printed in the United States of America È 


Pag SIF 


EDITORIAL STAFF 
EDITOR 
WILlem R. VAN ZWET 
ASSOCIATE EDITORS 

JAMES O. BERGER RICHARD D. GILL : Ross J. MUIRHEAD 
PETER J. BICKEL FRIEDRICH GÖTZE DAVID POLLARD 
LAWRENCE D. BROWN PIET GROENEBOOM DAVID O. SIEGMUND 
RAYMOND J. CARROLL PETER HALL A. F. M. SMITH 
CHING-SHUI CHENG R. Z. HASMINSEKII TERRY SPEED 
DENNIS D. Cox NELS KEIDING JON A. WELLNER 
Morais L. EATON STEFFEN L. LAURITZEN MICHAEL WOODROOFE 
DAVID F. FINDLEY BRUCE G. LINDSAY 

EDITORIAL ASSISTANT 


LUCIE W. VAN ZWET 


MANAGING EDITOR 
PAUL SHAMAN 


EDITORIAL ASSISTANT 
TAMMY MORROW Sai 


( 





Past EDITORS 

THE ANNALS OF MATHEMATICAL STATISTICS 
H. C. CARVER, 1930-1938 WILLIAM KRUSKAL, 1958-1961 
S. S. WILKS, 1938-1949 J. L. HODGES, JR., 1961-1964 
T. W. ANDERSON, 1950-1952 D. L. BURKHOLDER, 1964-1967 
E. L. LEHMANN, 1953-1955 Z. W. BIRNBAUM, 1967-1970 
T. E. HARRIS, 1955-1958 INGRAM OLEIN, 1970-1972 
THE ANNALS OF STATISTICS THE ANNALS OF PROBABILITY 
INGRAM OLKIN, 1972-1973 RONALD PYKE, 1972-1975 
I. R. SAVAGE, 1974-1976 PATRICK BILLINGSLEY, 1976-1978 
RUPERT G. MILLER, JR., 1977-1979 R. M. DUDLEY, 1979-1981 
Davin V. HINKLEY, 1980-1982 HARRY KESTEN, 1982-1984 


MICHAEL D. PERLMAN, 1983-1985 





EDITORIAL POLICY 


The main purpose of The Annals of Statistics and The Annals of Probabutty is to publish 
significant contributions to the theory of statistics and probability and thar applications. The 
emphasis ıs on importance and interest; formal novelty and mathematical correctness alone are 
not sufficient for publication. Especially appropriate are authoritative expository papers and 
surveys of areas in vigorous development. Because statistics is an evolving discipline, the Editors 
of The Annals of Statistics take a broad view of its domain and welcome pa) in interface areas. 
Contributors to The Annals of Statistics should review the editorial in the January 1980 1asue. All 
papers are refereed. 


NOTICE 


Manuscripts submitted for The Annals of Statistics should be sent to the Editor at the 
following nae eae 


Willem R. van Zwet 

De ent of Mathematics 
University of Leiden 

P. O. Box 9512 i 

2300 RA Leiden 

The Netherlands 


IMS CORPORATE MEMBERS 





THE AEROSPACE CORPORATION INTERNATIONAL BUSINESS MACHINES CORP 
Los Angeles, California Thomas J. Watson Research Center 
Yorktown Heights, New York 
oe aS eines ac SPRINGER-VERLAG NEw YORK INCORPORATED 
New York, New York 
BELL COMMUNICATIONS RESEARCH UNION OIL COMPANY OF CALIFORNIA 


Morristown, New Jersey Brea, California 


GENERAL MOTORS CORPORATION 
Research Laboratories 
Warren, Michigan 


IMS INSTITUTIONAL MEMBERS 





ARIZONA STATE UNIVERSITY MICHIGAN STATE UNIVERSITY 
Dept of Mathematics Dept of Statistics and Probability 
Tempe, Arizona East Lansing, Michigan 
AUSTRALIAN NATIONAL UNIVERSITY NATIONAL SECURITY AGENCY 
Canberra, ACT, Australia Fort George G Mead, Maryland 
BOWLING GREEN STATE UNIVERSITY 


: New MEXICO STATE UNIVERSITY 
Dept of Mathematics and Statistics Dept of Mathematical Sciences 


Bowling Green, Ohio 


CALIFORNIA STATE UNIVERSITY 

AT FULLERTON 
Depts of Math and Management Science 
Fullerton, Cabfornia 


Las Cruces, New Mexico 


NORTH CAROLINA STATE UNIVERSITY 
Dept of Statistics 
Raleigh, North Carobna 


CASE WESTERN RESERVE UNIVERSITY NonT EERI ee vear 
Dept of Mathematics and Statistics DeKalb S ematical Sciences 
Cleveland, Ohio t ROR, 

CENTERS FOR DISEASE CONTROL NORTHWESTERN UNIVERSITY 
Atlanta, Georgia - Dept of Mathematics 


Evanston, Ilinois 


CENTRE INTERNATIONAL DE 
RECONTRE MATH OHIO STATE UNIVERSITY | 


Marseille, F Dept of Statistics 
perro Columbus, Ohio 


CORNELL UNIVERSITY 


Dept of Mathematics OREGON STATE UNIVERSITY 
Ithaca, New York Dept of Statistics 

Corvallis, Oregon 
FLORIDA STATE UNIVERSITY 
Dept of Statistics PENNSYLVANIA STATE UNIVERSITY 
Tallahassee, Florida Dept of Statistics 


Univermty Park, Pennsylvania 
GEORGE WASHINGTON UNIVERSITY 


Dept of Statistics PRINCETON UNIVERSITY LIBRARY 
Washington, DC Princeton, New Jersey 

INDIANA UNIVERSITY PURDUE UNIVERSITY LIBRARY 
Dept of Mathematics West Lafayette, Indiana 


Bloomington, Indiana 


Iowa STATE UNIVERSITY 
Dept of Stat and Statistical Lab 


QUEEN’S UNIVERSITY 
Dept of Mathematics and Statistics 
Kingston, Ontario, Canada 


-Ames, lowa 

JOHNS HOPKINS UNIVERSITY RICE UNIVERSITY 

Dept of Biostatistics Dept of Mathematical Sciences 
Baltimore, Maryland Houston, Texas 

KANSAS STATE UNIVERSITY THE ROCKEFELLER UNIVERSITY LIBRARY 
Dept of Statistics New York, New York 

Manhattan, Kansas 

SIMON FRASER UNIVERSITY 

MARA INSTITUTE OF TECHNOLOGY Dept of Mathematics and Statistica 
Selangor, West Malaysia Burnaby, British Columbia, Canada 
MASSACHUSETTS INSTITUTE OF TECHNOLOGY SOUTHERN ILLINOIS UNIVERSITY 
Dept of Mathematics Dept of Math, Stat, and Comp Sci 
Cambridge, Massachusetts Edwardsville, Illinois 

MIAMI UNIVERSITY SOUTHERN METHODIST UNIVERSITY 
Dept of Mathematica and Statistics Dept of Statistics 


Oxford, Ohio Dallas, Texas 


STANFORD UNIVERSITY 
Dept of Statistics 
Stanford, California 


TEMPLE UNIVERSITY 
Dept of Mathematics 
Philadelphia, Pennsylvania 


Texas TECH UNIVERSITY 
Dept of Mathematics 
Lubbock, Texas 


UNIVERSITY OF ARIZONA 
Dept of Mathematics 
Tucson, Arizona 


UNIVERSITY OF BRITISH COLUMBIA 
Dept of Statistica 
Vancouver, British Columbia, Canada 


UNIVERSITY OF CALGARY 
Division of Statistica 
Calgary, Alberta, Canada 


UNIVERSITY OF CALIFORNIA 
Dept of Statistics 
Berkeley, California 


UNIVERSITY OF CALIFORNIA 
Division of Statistics 
Davia, California 


UNIVERSITY OF FLORIDA 
Dept of Statistics 
Gainesville, Florida 


UNIVERSITY OF GUELPH 
Dept of Mathematics and Statistics 
Guelph, Ontario, Canada 


UNIVERSITY OF ILLINOIS 
Department of Statustics 
Urbana, Ilhnois 


UNIVERSITY OF ILLINOIS AT CHICAGO 
Dept of Math, Stat, and Comp Sa 
Chicago, Illinois 


UNIVERSITY OF IOWA 
Dept of Statistics and Actuanal Sa 
Iowa City, lowa 


UNIVERSITY OF MANITOBA 
Dept of Statistica 
Winnipeg, Manitoba, Canada 


UNIVERSITY OF MARYLAND 
Dept of Mathematics 
College Park, Maryland 


UNIVERSITY OF MASSACHUSETTS 
Dept of Mathematics and Statistics 
Amherst, Massachusetts 


UNIVERSITY OF MICHIGAN 
Dept of Statistics 
Ann Arbor, Michigan 


UNIVERSITY OF MINNESOTA 
School of Statistica 
Minneapolis, Minnesota 


UNIVERSITY OF MISSOURI AT COLUMBIA 
Dept of Statistica 
Columbia, Missouri 


UNIVERSITY OF MONTREAL 
Dept of Mathematics 
Montreal, Quebec, Canada 


UNIVERSITY OF NEBRASKA 
Dept of Mathematics and Statistics 
Lincoln, Nebraska 


UNIVERSITY OF NEw MEXICO 
Dept of Mathematics and Statistics 
Albuquerque, New Mexico 


UNIVERSITY OF NORTH CAROLINA 
Dept of Statistics 
Chapel Hill, North Carolina 


UNIVERSITY OF OREGON 
Dept of Mathematica 
Eugene, Oregon 


UNIVERSITY OF OTTAWA 
Dept of Mathematica 
Ottawa, Ontario, Canada 


UNIVERSITY OF SOUTH CAROLINA 
Dept of Statistics 
Columbia, South Carolina 


UNIVERSITY OF STOCKHOLM 
Inst of Actuarial Math and Math Stat 
Stockholm, Sweden 


UNIVERSITY OF TEXAS AT AUSTIN 
Dept of Mathematics 
Austin, Texas 


UNIVERSITY OF TEXAS AT SAN ANTONIO 
Div of Math, Comp Sci, and Systems Deagn 
San Antonio, Texas 


UNIVERSITY OF VICTORIA 
Dept of Mathematics 
Victoria, Bntish Columbia, Canada 


UNIVERSITY OF VIRGINIA 
Dept of Mathematics 
Charlottesville, Virginia 


UNIVERSITY OF WASHINGTON 
Depts of Statistics and Mathematics 
Seattle, Washington 


UNIVERSITY OF WATERLOO 
Dept of Statistics and Actuarial Sci 
Waterloo, Ontario, Canada 


UNIVERSITY OF WISCONSIN AT MILWAUKEE 
Dept of Mathematical Sciences 
Milwaukee, Wisconsin 


VIRGINIA COMMONWEALTH UNIVERSITY 
Dept of Mathematical Sciences 
Richmond, Virginia 


VIRGINIA POLYTECHNIC INSTITUTE 
AND STATE UNIVERSITY 

Dept of Statistics 

Blacksburg, Virginia 


WAYNE STATE UNIVERSITY 
Dept of Mathematics 
Detroit, Michigan 

YORK UNIVERSITY 


Dept of Mathematics 
Downsview, Ontario, Canada 


THE ANNALS OF STATISTICS 


INSTRUCTIONS FOR AUTHORS 


Submission of Papers. Papers to be submitted for 
publication should be sent to the Editor of The Annals 
of Statistics. (For current address, see the latest issue 
of the Annals.) Four copies should be submitted on 
paper that will take wk corrections. The manuscript 
will not normally be returned to the author; when 
expreasly requested by the author, one copy of the 
manuscript will be returned. All manuscripts should be 
accompanied by a cover letter 


Preparation of Manuscripts. Manuscripts should be 
typewritten, entirely double-spaced, including refer- 
ences, with wide margins at sides, top and bottom. All 
copies must be completely legible. When technical 
reports are submitted, all extraneous sheets and covers 
must be rernoved. Typists should check an issue of the 
Annals for style. 


Submission of Reference Papers. Four copies of 
unpublished or not easily available papers cited in the 
manuscript should be submitted with the manuscript. 


Title. The title should be descriptive and as concise as 
is feasible, 1.e., it should indicate the topic of the paper 
as clearly as possible, but every word ın it should be 
pertinent. 


Abbreviated Title. An abbreviated title to be used as 
a running head 1s also required. This should normally 
not exceed 35 characters. For example, an article with 
the title “The Curvature of a Statistical Model, with 
Applications to Large-Sample Likelhood Methods,” 
could have the running head, “Curvature of Statistical 
Model” or possibly “Asymptotics of Likelihood Meth- 
ods,” depending on the emphasis to be conveyed. 


Affiliation. Indicate your present institutional affila- 
tion as you would like it to appear. 


Summary. Each manuscript is required to contain a 
summary, clearly separated from the rest of the paper, 
which will be printed immediately after the title. Its 
main purpose is to inform the reader quickly of the 
nature and results of the paper; it may also be used as 
an aid in retrieving information, The length of a 
summary will clearly depend on the length and dif- 
ficulty of the paper, but in general it should not exceed 
150 words. Formulas should be used as sparingly as 
possible within the summary. The summary should 
not make reference to results or formulas in the body 
of the paper—it should be self-contained. 


Footnotes. Footnotes should not be used, except as 
described under Title Page Footnotes below. Such 
information should be included within the text. 


Title Page Footnotes. Included as a footnote on page 
1 should be the headings: 


American Mathematical Society 1980 subject clas- 
sifications. Primary—; secondary—. 
Key words and phrases. 


The classification numbers representing the prn- 
mary and secondary subjects of the article may be 


found with instructions for its use in the Mathemat- 
cal Reviews Annual Subject Index-1980. The key words 
and phrases should describe the subject matter of the 
article; generally they should be taken from the body 
of the paper. 


Acknowledgment of support. Grants and contracts 
should also be included ın this footnote. 


Identification of Symbols. Manuscripts for pub- 
lication should be clearly prepared to insure that all 
symbols are properly identfied. Distinguish between 
“oh” and “zero”; “ell” and “one”; “epsilon” and “ele- 
ment of”, “summation” and “capital sigma,” etc. Indi- 
cate also when special type is required (Greek, Ger- 
man, script, boldface, etc.); unless indicated otherwise, 
formula letters will be set in italics. Acronyms should 
be introduced sparingly. Any handwritten symbols 
should be clearly identified. 


Figures and Tables. Figures, charts, and diagrams 
should be prepared in a form suitable for photographic 
reproduction and should be professionally drawn twice 
the size they are to be printed. (These need not be 
submitted until the paper has been accepted for pub- 
lication.) The printer does not improve upon the qual- 
ity of the figures submitted. Tables should be typed on 
separate pages with accompanying footnotes m- 
mediately below the table. 


Formulas. Fractions m the text are preferably written 
with the solidus or negative exponent; thus, 
a+b 


(a + b)}/(c + d) is preferred to ——; 


-1 
Fa’ and (27) 


or 1/(27) to = Also, a*® and aye, are preferred 


to a* and a, respectively. Comphcated exponentials 
should be represented with the symbol exp. A frac- 
tional exponent 18 preferable to a radical sign, 


References. References should be typed double-spaced 
and should follow the style: 


Kiefer, J. C. (1976). Admissibility of conditional con- 
fidence procedures. Ann. Statist. 4 836-865. 


In textual maternal, the format “... Keifer 
(1976) ...” should be used. Multiple references can be 
distinguished as “... Kiefer (1976a)...”. Abbreviation: 
for journals should be taken from a current index issue 
of Mathematical Reviews. 


Addresses. The permanent address of each autho 
should be typed following the references. 


Galley Proofs. The author will ordinarily receive gal. 
ley proofs. Corrected galley proofs should be sent tc 
AOS Redactory, Science Typographers, Inc., lf 
Industrial Boulevard, Medford, NY 11763. 


Correspondence. All correspondence with the Edite 
must refer to the manuscript number of the paper. Th: 
number will be on the card sent to the author acknow 
edging receipt of the article. 


The Annals of Stafistus 
1986, Vol. 14, No 3, 781-818 


INVITED PAPER 
INFLUENCE FUNCTIONALS FOR TIME SERIES! 


By R. DOUGLAS MARTIN AND VICTOR J. YOHAI 


University of Washington and University of Buenos Aires 
and CEMA, Buenos Aires 


A definition is given for influence functionals of parameter estimates 
in time-series models. The definition involves the use of a contaminated 
observations process of the form yf = (1 —2/)x,+27/u, ¢=1,2,..., 
0 < y < 1, where x, is a core process (usually Gaussian), w, is a contaminat- 
ing process, and z7 is a zero—one process with P (z? = 1) = y + o(y). This 
form is sufficiently general to model such diverse contamination types as 
isolated outliers and patches of outliers. Let T(u%) denote the functional 
representation of a given estimate, where the measures p?, 0 < y < 1 for yf 
are in an appropriate subset of the family of stationary and ergodic measures 
on ( R”, 8”). The influence functional IF is a derivative of T along “arcs” 
traced by 2, as y — 0, and correspondingly p?, > p. Although this influence 
functional is similar in spirit to Hampel’s influence curve ICH for the iji.d, 
setting, it is not the same as ICH. However, a simple relationship between the 
IF and the ICH is established. Results are given which aid in the computa- 
tion of IF and insure that IF is bounded. We compute the IF for some robust 
estimates of the first-order autoregressive and first-order moving average 
parameters using various contamination processes. A definition of gross-error 
sensitivity (GES) for the IF is given, and some estimates are compared in 
terms of their GES’s. Also the IF is used to show that a class of generalized 
RA estimates has a certain optimality property. Finally, some possible gener- 
alizations of the IF are indicated. 


1. Introduction. The influence curve, introduced by Hampel (1974), has 
been referred to as “perhaps the most useful heuristic tool of robust statistics” by 
Huber (1981) in Chapter 1.5 of his recent book. Indeed the usefulness of the 
influence curve in situations where the data consist of independent and identi- 
cally distributed (ii.d.) random variables or random vectors is reflected by its 
appearance in many papers on robustness, and by attempts to extend its 
definition to cover situations other than the standard point estimation problems. 
For example one finds recent papers on influence curves in the context of errors in 
variables (Kelly, 1984), quantal bioassay (James and James, 1983), problems 
involving censoring (Samuels, 1978), and for parameter testing (Lambert, 1981; 
Ronchetti, 1982), and goodness of fit tests (Michael and Schucany, 1985). 

In spite of the pervasive nature of the influence curve and the length of time 
elapsed since Hampel’s initial contribution, a completely satisfactory definition of 


Received July 1984; revised April 1986. 

‘Research supported by the Office of Naval Research under contracts N00014-82-K-0062 and 
N00014-84-C-0169. 

AMS 1980 subject classifications. Primary 62G35; secondary 62M10, 62F10. 

Key words and phrases. Influence functionals, influence curves, time series, robust estimates. 


781 


782 R. D. MARTIN AND V. J. YOHAI 


influence curve for the time-series setting has not yet been given. A number of 
authors have suggested carrying over Hampel’s definition of influence curve to 
the time-series setting: Martin and Jong (1976), Portnoy (1977), and Martin 
(1980) mention this possibility, while Künsch (1984) has pursued the use of 
Hampel’s influence curve for obtaining infinitesimal optimality results for autore- 
gression estimates. See also Chernick, Downing, and Pike (1982). However, we 
shall argue that while Hampel’s influence curve plays a central role in reflecting 
the “influence” of contamination in the ii.d. setting, it does not adequately 
capture the nature of contamination effects in the time-series setting. We at- 
tempt to remedy the situation by providing a useful definition of influence 
functional (IF) for time-series parameter estimation problems which seems natu- 
ral and closely coupled to intuition. 

One of the chief features of the time-series setting is the fact that estimators 
which take account of the time-series structure are not invariant under permuta- 
tions of the data, as in the case of estimators for i.i.d. situations. Consequently, 
basic permutation dependent issues of contamination, such as the distinction 
between outliers which occur in isolation versus outliers which occur in patches, 
become important. Such distinct types of behavior are common occurrences in 
real data, as any careful and experienced practitioner knows all too well. Our 
definition of influence functional reflects the difference in impact of these two 
types of behavior, as well it should. These differences are clearly revealed in some 
explicit computations of influence functionals. 

As a point of departure we briefly recall Hampel’s (1974) definition of influence 
curve and its properties. The context is that of possibly vector-valued indepen- 
dent observations y,,...,y, with common distribution F, and an estimator 
T, = T,(y,,-.-,¥,) which may also be vector-valued. It is assumed that T, may 
be obtained from a functional T = T(F) defined on a suitably rich family of 
distributions by evaluating T at the empirical distribution function Fy: T, = 
T(F,). Let F, = (1 — y)F + yô, be a contamination distribution, where ô, has all 
its mass at y. Then Hampel’s influence curve is the directional or Gateaux 
derivative at F of the functional T, in the “direction” determined by 6,: 


_ 1K) ~T() 
(1.1) ICH(y) = ICH(y;T, F) = lim a 
y3 
The influence curve is both an asymptotic and a local (or infinitesimal) tool. 
The influence curve has several useful properties (Hampel, 1974): 


(P1) an appealing heuristic interpretation; 

(P2) a convenient role in formal asymptotics; 

(P3) an indicator via gross-error sensitivity (GES) of maximum bias due to 
infinitesimal contamination; 

(P4) the construction of optimal estimates under the constraint of a bounded 
gross-error sensitivity (GES). 


Results on (P4) for ordinary regression may be found in Hampel (1975, 1978), 
Krasker and Welsch (1982), and Huber (1983). Similar results based on ICH for 
autoregression have been obtained by Ktinsch (1984). 


INFLUENCE FUNCTIONALS FOR TIME SERIES 783 


The remaining parts of the paper are as follows: As preliminaries, Section 2 
introduces a general class of contamination processes which is useful for our 
definition of influence functional (IF), along with several particular types of 
contamination, and, also, some notation used in the remainder of the paper. 
Section 3 discusses functional representations for time-series parameter esti- 
mates, introduces ¥ estimates, the main class with which we work, and gives 
some specific ¥ estimates. 

Section 4 introduces our definition of IF, gives results which aid in the 
computation of the IF, and which insure boundedness of the IF. Section 5 gives 
specific results for generalized M-estimates and RA estimates of first-order 
autoregressive and moving average models. The results given in Sections 4 and 5 
address Kiinsch’s (1984, Section 2.6) second open question in the context of our 
definition of time series IF: namely, we include the case where the estimator 
depends upon the measure for the process (not just on finite-dimensional margi- 
nal measures). In particular, it is shown that although bounded psi functions (i.e., 
bounded summands in estimating equations) yield a bounded IF for AR(1) 
models, this condition is not sufficient to insure a bounded IF for MA(1) models. 
On the other hand, redescending psi functions can yield a bounded IF for MA(1) 
models. These results revel a key distinction with regard to robustness between 
models having moving average components and those which do not. 

Section 6 introduces a definition of gross-error sensitivity (GES), and com- 
putes GES’s for the estimates and models treated in Section 5. Section 7 
introduces a class of generalized RA-estimates and establishes, using the defini- 
tion of IF, a certain optimality of these estimates. Section 8 sketches some 
possible generalizations of the IF through applications to a white-noise test 
statistic, and to spectral density estimation. Finally, proofs of theorems are 
collected in Section 9. 

Throughout, we keep (P1)-(P4) of the ICH in mind, with a view toward 
preserving the most essential of these in the time-series setting. Our preference 
ranking for these properties makes (P1) and (P3) paramount, with (P4) highly 
desirable. 


2. Contamination processes for time series. 


2.1. The importance of outliers’ time configurations. In the case of estimates 
which are intended for use in the iid. setting, such as ordinary location M 
estimates, the influence curve may be defined asymptotically as in (1.1), with 
Hampel’s (1974) attendant finite-sample size approximation, by placing all the 
contamination at a single point. This approach works essentially because most 
estimators intended for use with iid. data are invariant under permutations of 
the data, and may be obtained from functionals T of the marginal distribution 
function F by evaluating T at the empirical marginal distribution function F.. 
Under such conditions, the specific time configuration of observation times at 
which the contaminating points occur is irrelevant. 

By way of contrast, the time configuration of the contaminating points will be 
important in the case of estimates which make use of the time-series structure. 


784 R. D. MARTIN AND V. J. YOHAI 


For the sake of specific illustration consider the ordinary lag-one correlation 
coefficient 


Rs Dp el 
ie y 


with estimation of the mean ignored for simplicity (the behavior to be described 
is qualitatively the same when Y is included). It is clear that the values of a fixed 
number k of isolated outliers appear quadratically in the denominator of 6, but 
only linearly in the numerator; by “isolated” we mean that each pair of outliers 
is separated in time by at least one nonoutlier observation. If the outliers have a 
common value or “amplitude” ¢, then 6 — 0 as { — œ, and the effect might be 
described as bias toward zero. On the other hand, if the & outliers of common 
amplitude ý are contiguous, i.e. if they form a patch of length k, then 
ô —> (k — 1)/k as § > œ. For long patch lengths k, the effect might be described 
as bias toward unity. 

Since different time configurations can have quite different impacts on an 
estimate, it will be natural when defining an influence functional for time series 
to work with contaminated processes which have the flexibility to provide 
different time configurations of contamination or outliers, as well as a controlled 
contamination fraction. 


p= 


2.2. The general replacement model. The following component processes and 
associated stationary and ergodic marginal measures on (R”, #”) are used to 
construct the contaminated process: 


x,~p,, the nominal or core process, often Gaussian, 
W, ~ py, a contaminating process, 
zy ~ pt, a0-1 process, 

where 0 < y < 1, and 

(2.1) P(zY = 1) = g(y) =y + o(y) 


for some function g. The contaminated process ¥,’ is now obtained by the general 
replacement model 


(2.2) yt = (1 -— 27)x, + 27w, 


where y? ~ yp’ and p$, = Hy, i.e., zero contamination results in perfect observa- 
tions of x,. In general we may wish to allow dependence between the z?, x,, and 
w, processes in order to model certain kinds of outliers. Correspondingly, the 
measures 4%, 0 < y < 1, are determined by the specification of the joint measures 
HL. SYLL 


The pure replacement model. Here z}, x, and w, are mutually independent 
processes, i.e., 


Powe = Bisby: 


INFLUENCE FUNCTIONALS FOR TIME SERIES 785 


The AO model. Allowing dependence between x, and w, means that the 
additive outliers (AO) model 


(2.3) y= 2, + vf 


used in previous studies (Denby and Martin, 1979; Bustos and Yohai, 1986) may 
in some situations be obtained as a special case of (2.2). This is the case for 
example when the v* have, for marginal distribution, the contamination distribu- 
tion F, = (1 — y)5) + yH with degenerate central component 6). Just set w, = 
x, + v, with v, having marginal distribution H, let g(y) = y in (2.2), and let z7 
be independent of x, and v, Here pY,,. = wi, = Bibs, 2+. Throughout the 
remainder of the paper we use the version of the AO model obtained from (2.2) 
with w, = x; + v, 

The two main time configurations for outliers are (a) isolated outliers, and (b) 
outliers occurring in patches or bursts. The need for modeling the latter behavior 
is well recognized by those who have dealt with real time-series problems. It may 
also be desirable to combine these two situations in order to adequately model 
some time-series data. 


Independent outliers. Since isolated outliers are typically produced by inde- 
pendence in the w, or v,, we shall use the terms “independent” and “isolated” as 
interchangeable adjectives. 

Situations in which the outliers are mainly isolated are easily manufactured 
from either the pure replacement or AO form of (2.2) by letting zY be iid. with 
&(y) = y and w, an appropriately specified process. For example, w, could be an 
iid. Gaussian process with mean zero and suitably large variance, or the w, could 
be identically equal to a constant value ý. 


Patchy outliers. Patches of approximately fixed length can be arranged in the 
following way. Let w, and v, be highly correlated processes. In case these 
processes are identically equal to a constant {, they will be regarded as highly 
correlated. Now let Z? be an ii.d. binomial B(1, p) sequence, and set 

if Z? , = 1 for some l = 0,1,...,k — 1, 
0, else. 
Here we set y = kp, with k fixed and p variable. Then since 


P(zY¥ =1) =1-(1-p)*=kp + o(p) 


(2.4) zy = 


we have 


(2.5) ely) =7 + o(y) 


and the average patch length is & for y small. We denote the probability measure 
of the process {z7} by p% Y. 


2.3. Some notation. In the sequel we shall use the following notational 
conventions. Finite sets of contiguous x, will be denoted 


(2.6) Cyt Jsi, 


786 R. D. MARTIN AND V. J. YOHAI 


and similarly with w, We often need dummy arguments which are shadow 
representatives for observations, and we use y,’s for this purpose. Correspond- 
ingly, finite segments of y,’s are denoted 


(2.7) y’ = (Y Yiz’ y), J s i. 


We will have little need to refer to finite segments of y7 and zY. However, we 
do need semiinfinite sequences of ¥,’’s (with measure ’,) and x,’s (with measure 
H), as well as semiinfinite sequences of dummy arguments y,. These we denote 


(2.8) y: = (y3, yi ese) 
(2.9) X= cee er 
(2.10) Y, = (Yo dessa: 


and so on. The use of y,’ and x, is almost exclusively reserved for computations 
involving expectations under y’, and u,, respectively, and since these are sta- 
tionary measures, our typical arguments are yy and x,. Since y, is usually a 
dummy argument, we usually use y,. 

Semi-infinite sequences such as (2.8)-(2.9) may be regarded as points in R®. 
Doubly infinite sequences such as (..., ¥_1, Yo Yı Yz...) are points in the space 
R~ © of all doubly infinite sequences. 


3. Time-series parameter estimates and functionals. It is assumed 
throughout that the observations y, are realizations of a stationary and ergodic 
process on R~ ©, with associated probability space (R7, #, u), # being the 
family of Borel sets in R~*:”, with u in the set P,, of all stationary and ergodic 
measures on (R~~:™, Z). In this time-series setting, it is usually possible to 
represent the asymptotic value of a parameter estimate as a functional, T = T(), 
defined on a subset P, of P,e- 

The basic definitions of influence functional and gross-error sensitivity for 
time-series parameter estimates, which we give subsequently, are for quite 
general functionals T(:). However, all specific ensuing results are for a special 
class of functionals T associated with those time-series models parameter esti- 
mates T, which may be computed as a solution to the estimating equation 


(3.1) 2 LACANS Yo Ta) E L ¥,(y,,T,) = 0. 
t= t=1 


Here each ¥, is a function from R'x R” to R”. Both Ý, and T, may be 
vector-valued, as when estimating the parameters of an autoregressive-moving- 
average model of orders p and q. For the sake of notational simplicity we take 
the y, to be scalar-valued, but all of what follows applies equally well to the case 
of vector-valued time series. 

The subscript i on ¥, accounts for “end effects” which vanish either after a 
finite number of observations (as in Example 1 to follow), or asymptotically (as in 
Examples 2-4 to follow). In either case the asymptotic value T = T() of T, can 
usually be determined through the use of a fixed psi function & which for each t 


INFLUENCE FUNCTIONALS FOR TIME SERIES 787 


satisfies 
(3.2) lim ¥(a,,...,a,,t) = lim ¥(a,t), Va = (a,,a,,...) E€ R”. 
imc t-* co 


A special case of such a ¥ is one which depends only on a finite number of 
coordinates: for each t 


(3.2) ¥(a,,...,4,,t)=¥(a,t), i2k,Vae R®. 


Example 1 to follow falls into this category, whereas Examples 2-4 require the 
general form of ¥. 

Under suitable regularity conditions, which include ergodicity, one expects to 
have 


12 i 
lim — cre + %,t) = lim — È ¥y,,t) = EVX(y,,t). 
n> W =] 


n-o Nh, 


Therefore we assume that the asymptotic value T() is defined by 


(3.3) JET) du(y,) = 0. 


We shall assume that (3.3) either has a unique root tọ = T(x), or that a 
well-defined solution is available in the case of multiple roots. T is then defined 
on P, consisting of all » in P,, such that the integral in (3.3) exists and is finite. 

An estimate T, defined by (3.1) is called a ¥ estimate, and this term will also 
be used to describe the associated asymptotic version T defined by (3.3). The class 
of © estimates is quite large and contains both classical and robust parameter 
estimates, as the examples to follow show. 

Our examples consist of two classes of robust estimates of the parameters of 
first-order autoregressive, AR(1), and first-order moving-average, MA(1), models. 
The AR(1) model is 


(3.4) x, i OX, 1 at u, 
and the MA(1) model is 
(3.5) X, 5u- bu,» 


where the innovations u, are assumed to be iid. with a common N(0,1) 
distribution. The assumption of a known innovations scale o, = 1 and known 
location p = 0 is made to simplify the exposition. 


EXAMPLE 1. GM/BIF estimates for the AR(1) model. Here T,= ¢ is a 
generalized M estimate (GM estimate) or bounded-influence estimate (BIF 
estimate) of p. These estimates are ¥ estimates, with È, = p, given by 


(3.6) ¥,(y: + $) = ny, F $J- 7y- (1 7 (n a), i 2 2, 
for some bounded function 7 = 7(-, +). Correspondingly, the limit ¥ function is 
(3.7) vy, ¢) = af Yi OY; y = ey). 


The two main variants of GM/BIF estimates are as follows (see Denby and 
Martin, 1979; Martin, 1980; Bustos, 1982). 


788 R. D. MARTIN AND V. J. YOHAI 


Mallows variant. 


(3.8) n(é,, $2) T ¥(E,) YCE) 


for some bounded robustifying psi function ~. This type of estimate was sug- 
gested by Mallows (1976) in the non-time-series regression setup. 


Hampel—Krasker—Welsch (HKW) varant. 


(3.9) nC $2) = ($2) 


for some bounded robustifying function yp. 

The choice 7(§,, £2) = &W(&,) yields the ordinary M estimate of » (see for 
example Martin, 1982). As we shall see, this estimate does not have a bounded 
influence function, using either our definition or Hampel’s definition as extended 
to the time series context (see Section 5). The ordinary M estimate and the 
Mallows and HKW type estimates all reduce to the least-squares estimate, which 
has an unbounded influence function, by either definition when y is the identity 
function. 


EXAMPLE 2. RA estimates for the AR(1) model. Recently Bustos and Yohai 
(1986) have introduced a new class of estimates for ARMA models. These 
estimates are called RA estimates because they are based on robust estimates of 
residual autocovariances. For the AR(1) case the RA estimate ¢ is defined as 
follows. Let 


(3.10) r,(¢) = dy, = bY,-4 
and let 
A de, otk 
(3.11) %=r7(¢)=— E nlr(9),4-(6)), Oslsn-2, 
tml+2 


denote a robust lag-/ autocovariance estimate for the residuals, with robustifying 
function 7 = 7(-,-). Then ¢ is a solution of the estimating equation 


n-2 n 
(3.12) ee) =— È Ear ‘a(r ($), r.-($)) = 0. 
i=) N m3 lm] 

Estimates obtained for the choices n(¢,, £2) = W(£,)¥(é2) and n(£,, $2) = 
¥(£,€) are again called Mallows and HKW estimates, respectively. It may be 
shown that if 7(,,§) = $3, then is asymptotically equivalent to the 
least-squares estimate, and if n(£,, £2) = ¥(¢,)£>, then ¢ is asymptotically equiv- 
alent to an M estimate. 

Again ¥, = ý, is scalar-valued, and ¥,(y}, ẹ) is given by the inner summation 
in (3.12), with @ replaced by ¢, and the limit ¥ function is 


(3.13) Ilyn p) = £ pnh y — $90) X- — OY-,)- 


Jml 


EXAMPLE 3. GM estimates for the MA(1) model. In order to motivate the 
definition of GM estimate for the MA(1) model, we first note that the least-squares 


INFLUENCE FUNCTIONALS FOR TIME SERIES 789 


estimate 6, s Of 0 is a solution of the equation 


(3.14) Lot(s ees) =o, 

where 

(8.15) s*(0) = mA j+ Dy, 

and i 

(3.16) Oa E 
sh 


It is easy to verify that when y,=x, with x, the MA(1) model (3.5), 
lim, „var r*(0) = 1 and lim, _, ,vars*(@) = 1/(1 — 67). Therefore, by analogy 


with the AR(1) model, we define the GM estimate 6 of 6 by 
(3.17) È a(7*(4), 8% (6)(a — 6?)'”) = 0. 
1=2 


Here ¥,(y}, 0) is given by the ith summand above, with s* ,(@) and r*(8) 


expressed in terms of the y, for 1 < j < i. The limit ¥ function is 


(3.18) H(y..8) = alrt) #9(0)(1 — A, 
where 
(3.19) s,(0) = E+ 1)0y,,, (8) = È Oa 


EXAMPLE 4. RA estimates for the MA(1) model. The RA estimates 6 for 
the MA(1) model have exactly the same form of estimating equations as in the 


AR(1) case: 
n t-2 
(3.20) E E n(x), 7,-1(8)) = 0 


w= 3 l= 


with the residuals given by 


t~1 
(3.21) r,(6) TF L 07y, 
j=0 
It can be shown that when 7(€,, £2) = £12, 6 is asymptotically equivalent to the 
least-squares estimate. 
The function 4,(y}, 0) is given by the inner summation of the estimating 
equation (3.20), with @ replaced by 8. The limit ¥ function is 


oo 


eo o 
(3.22) W(y,, 9) = L 02-17 L O*y, ps 2 BEY ky i 
k=0 


j=l k=0 


790 R. D. MARTIN AND V. J. YOHAI 
4. Influence functionals for time series. 


4.1. ICH for time series and the need for a new definition. Since we deal 
only with estimates which are defined asymptotically by functionals T = T(x), it 
might at first blush be tempting to simply apply Hampel’s definition (1.1) in the 
time-series parameter-estimation setting. To do so, one would replace the uni- 
variate contaminating distribution F, by the process contamination measure 
p, = (1 — ye + yô,, where in general y = (..., 1, Yor Yzy.) ER, ô is 
the unit mass at y, and p is a measure in P,.. We assume that a Y estimate 
T = T(w) is defined by (3.3) for not only stationary and ergodic measures 
#, for the core process x, in (2.2), but also for the contamination measures 
py =(1— ye, + Yp, OS y <1. Since 6, € P,a this places some restriction 
on y. 


DEFINITION 4.1 (ICH for time-series Ẹ estimates). We define 
T -T 
(4.1) ICH(y,T, ») = lim ae 
y>0 Y 
providing the limit exists. 


Under suitable regularity conditions (4.1) and (3.3) yield 


(4.2) ICH(y,) = ICH(y,,T, p.) = —C7¥(y;; to), 
where tọ = T(,) and the nonsingular matrix C is given by 
(4.2) C = [(3/3t)E E(x, t)lit 


This possibility is most tempting when T depends only on a finite-dimensional 
marginal measure u*, as in the case of GM/BIF estimates for autoregression, 
where the analogy with ordinary regression is suggestive. For such cases one 
would set u, = p = (1 — y)ek + y6,-++1, where 6,-++1 has all its mass at y;**?. 
Such a definition was suggested by Martin and Jong (1977), Portnoy (1977), and 
Martin (1980), and pursued in a more serious vein by Künsch (1984), who focused 
on Hampel-type optimality results based on ICH. Kiinsch in fact proved that 
(4.2)—(4.2’) holds for pth-order autoregressions, and in that context also provided 
an empirical interpretation in the context of adding a single observation at the 
end of the series. 

The ICH (4.1)-(4.2) does typically give the correct asymptotic variance— 
covariance matrix for T,,: 


(4.3) V=V,+ Ð (V,+ Vi), 
t=] 

where 

(4.4) V, = var[ICH(y,), ICH(y, +l . 


In the case of estimating location with an ordinary M estimate V reduces to the 
expression obtained by Portnoy (1977). In the case of autoregression GM/BIF 


INFLUENCE FUNCTIONALS FOR TIME SERIES 791 


estimates, V coincides with the expression stated in Martin (1980) for the 
Mallows variant, and established rigorously by Bustos (1982) for a general class 
which includes both Mallows- and HK W-type estimates. 

Unfortunately there is a very basic sense in which ICH is not the most 
appropriate definition for the time-series context: the definition does not corre- 
spond to any interesting contaminated time-series. The reason is simple enough. 
One computes ICH by letting p, = (1 — y) + yô,, where ô, on (R™*'”, #) 
puts all its mass at the point y € R~~:”. But the mixture structure of u, implies 
that each sample path of the series is generated either by u or by 6,, and this 
hardly reflects the nature of any real contaminated series arising in practice. 


4.2. The time-series influence functional and its properties. 


DEFINITION 4.2 (Time-series influence functional). Suppose the estimator 
sequence {T,,} is specified asymptotically by a functional T = T() defined on a 
subset P, of P,,, and suppose that u}, is given by (2.1)-(2.2). Then the influence 
functional IF of T is defined as 


(4.5) E(u, T, {u%}) = lim T(py) — T(uS) 


y—>0 Y: 


provided the limit exists. 


Note that the influence functional depends not only upon the estimator T, the 
nominal model p»,, and the contamination process measure p, a8 “main” argu- 
ment, but also upon the particular trajectory or arc of contamination measures 
{u?,} = {n3,: 0 < y < 1}, as the fraction of contamination g(y) = y + o(y) tends 
to zero. By way of contrast the ICH for time series depends only on T, the 
nominal measure p, with u = p, in the present context, and the “main” argu- 
ment ô,. Correspondingly, ICH is a directional or Gateaux derivative with 
direction specified by p, and 4,. In order to capture the essential features of 
time-series contamination in an influence functional, we take derivatives along 
particular arcs {4%} to u, in P,. 

It turns out that there is a rather simple connection between IF and ICH 
under certain conditions, and it is a connection which facilitates the computation 
of IF. The first theorem to follow gives sufficient conditions to insure that a 
trivial expansion argument will yield the key relation (4.6) below. 


THEOREM 4.1. Assume that T is a ¥ estimate with 


(a) lim, = oI'(#,) > T(z.) = to. 
(b) Put 


m(y,t) = E¥(y?,t). 
There exists an £ > 0 such that 
D(y,t) = (0/dt)m(y, t) 


792 R. D. MARTIN AND V. J. YOHAI 


exists for O< y <e, {t—t,y|<«, and D(y,t) is continuous at (0,t,). Also 
C = D(0,t,) is nonsingular. 

(c) lim , .. om(y, to)/Y exists. 

Then 





E ICH(y; 
(4.6) IF(x,,,T, {#%}) = lim = ) 


CoMMENT 4.1. For the case where ¥ depends on finitely many arguments the 
above expression coincides with (1.18) of Künsch (1984), which in that case is 
valid for more general contamination than those considered here. 


Suppose that a W estimate is selected with a view toward use in the i.i.d. 
setting. For example, Ý might define an ordinary M estimate of location or scale. 
Then the ith summand in (3.1) depends only on the ith observation and T, is 
permutation invariant by virtue of the equality 


Ẹyi t) = (yt), i=1,2,...,0 


The following corollary shows that IF is a strict generalization of ICH in the 
sense that the two coincide for such estimates when y? is an iid. pure 
replacement process. 


COROLLARY 4.1. Suppose that the Ý estimate satisfies 
¥(y).t) = F550), Vt, i=1,2,... n 
and that y, is given by (2.2) with z} i.i.d, g(y) = y, and w, = $. Then 
(4.7) IF( uT, {u3}) = IF(¢,T, {u%}) = ICH(¢). 


Assumption (b) in Theorem 4.1 is quite restrictive. The next theorem gives a 
useful set of conditions to insure that the relation (4.6) between IF and ICH 
holds for many types of estimates of interest, including GM and RA estimates, 
when the process {z7} is independent of {x,,w,} and has distribution p% Y 
corresponding to patches of length k generated according to (2.4) of Section 2.2. 
The theorem also shows how to compute IF in these cases. 


THEOREM 4.2. Let T be a ¥ estimate with tọ = T(u,) and suppose that 
Pocwe = Hrwtt Y. Assume ICH is given by (4.2)-(4.2’) and that: 
(a) T(wy) — to = OCy). 
(b) (b) E¥(x,,t) is differentiable at t = t, and the derivative matrix C given by 


(4.2’) is nonsingular. 
(c) Form > 1 put 


H,,(t) = sup|E¥(y7-", Yim t) — EV(y2-", y* mt), 
where the supremum is with respect to every yo-™ = (Yis Yoses Yo—-m) Yi-m = 
(Yims Yam- Yom = (YË ms Ye mo---) such that each y,-, and y*_, may be 


INFLUENCE FUNCTIONALS FOR TIME SERIES 793 


and put 


Hx(e) = sup H,,(t). 
it- tolse 


either x,_, OF W,_,, 


There exists e, > 0 such that 


E Hile) < o. 


m=] 
(d) Put 
H,(t) = sup E¥(y,,t), 


where the supremum is with respect to every y, = (J, Yo...) such that y,_, may 
be either x,_, or w,_,, and put 


Hf(e) = sup H(t). 


t- tolse 


-p 


There exists 2, > 0 such that H#(e€9) < œ. 
(e) For any y, = (Yı, ¥,---), where each y,_, is either x,_, or w, _, we have 


jim V(y,,t) = ¥(y,,t.) a.s. 
>to 


and there exists £ > 0 such that 
E sup |¥(y;,t)| < œ. 


t-toise 
Then (4.6) holds, and 
1 00 
(4.8) E(u T, {nu 7}) =r ee rae Ł Gi 
j= 
where 
(4 8’) Gt oe E¥(wi”,x_,,to) if 0 <j sk- 1, 
l 7 \EY(xi** wit, x to) ifj k. 


COMMENT 4.2. We will see in Section 5 that the assumptions of this theorem 
are satisfied under general regularity conditions on the y function for GM and 
RA estimates for the AR(1) and MA(1) models (Theorems 5.2 and 5.4, respec- 
tively). 


As in the case of the Hampel influence curve, boundedness of IF is of interest 
in connection with robustness. The following theorem gives sufficient conditions 
for boundedness of IF. We introduce some notation. Given an n X m matrix A, 
let ||A|| be defined as sup{|Au|: u| = 1, u E R™}. Let N, be the set of nonnega- 
tive integers. Given a subset I C No, let y, ; = {y,_,: k © I}. 


794 R. D. MARTIN AND V. J. YOHAI 


THEOREM 4.3. Let T be a ¥ estimate such that (4.6) holds. Assume also 


(a) ¥(y,,t) = 2 TÒY, rt) 
J=l 
where the I, are subsets of N, such that the number of elements in each I, is 


uniformly bounded by a finite integer h, and 


(b) En,(x,,,,to) =0, VJ. 
supl (yi, 1. to) < K, 
NL 
with 
L K,=K<o. 
gel 


(c) E¥(x,,t) is differentiable at t= t, and the derivative matrix C is 
nonsi > 
Then 


(4.9) [IF(u,,,T, {u%})|< 2K CI. 


COMMENT 4.3. The above theorem applies to GM and RA estimates of 
autoregressive models if is bounded—see (3.7) and (3.13). However, bounded- 
ness of Ý is not in general sufficient for the boundedness of IF. In Section 5.2 we 
give an MA(1) model example where estimates with bounded ¥ have unbounded 
IF, the reason being that the m, depend on an infinite number of coordinates for 
moving-average models. 


5. Influence functionals for AR(1) and MA(1) models with additive 
outliers. The computation of time-series influence functionals will be carried 
out for both GM estimates, denoted TOM (cf. Examples 1 and 3) and RA 
estimates, denoted TFA (cf. Examples 2 and 4). Throughout this section and the 
remainder of the paper the only outlier model we deal with is the AO model as 
described following (2.3), and with Gaussian AR(1) and MA(1) x, processes (3.4) 
and (3.5). We selected the AO model for our computations primarily because it 
has been used in previous studies (e.g., Denby and Martin, 1979; Bustos and 
Yohai, 1986). We hope later on to make computations for pure replacement 
models, higher-order AR and MA models, etc. 

Corresponding to the special cases treated, it is convenient to replace the 
notation IF = IF(4 „T, {#%}) by [Fao » = Fio, a(o T, A) where T = T™ or 
T = TPA and à = 6 or 0. Here, the subscript k indicates the patch length for 
patchy outliers, and with p, fixed, specification of p, is equivalent to specifica- 
tion of 4, for the AO model. When k = 1 we have independent outliers. We also 
replace ICH by ICH. 


5.1. AR(1) models. We first state results concerning the ICH, and asymp- 
totic variance (in central limit theorem form) of TOM and TP at the Gaussian 


INFLUENCE FUNCTIONALS FOR TIME SERIES 795 


model. The need for the asymptotic variance under the nominal Gaussian model 
arises when comparing the IF’s of different estimates: their tuning constants will 
be adjusted to obtain matched asymptotic efficiencies at the nominal model. The 
following assumptions will be used to prove these results: 


(A1) n(-, +) is continuous and odd in each variable. 
(A2) |n(u,, Ug)| < K|u,|"|u,|*2, where k, and k, are either 0 or 1. 


dn(u,, Uy) i 
(A3) N, (u, Ue) eee an, z=1,2, 
are continuous and 
\n,(u,, Ue)| < K luaj", \no( uy, Up)| < K |u|”, 
where h, and h, are either 0 or 1. 
(A4) 
(5.1) B = E{v- (8/ðx)n(x, 0) |,au} #0, 


where u and v are independent N(0, 1) random variables. 


Observe that the LS estimates satisfy (A2) with k, = k, = 1 and (A3) with 
h, = hg = 1. M estimates with bounded y and y satisfy (Al) with k, = 0, 
k, = 1 and (A3) with h, = 1 and h, = 0. GM and RA estimates with bounded n 
satisfy (A2) with k, = k, = 0, and if they are of the Mallows types with y 
bounded, they satisfy (A3) with h,=A,=0, while if they are of the 
Hampel—Krasker type with ¥’ bounded, (A3) is satisfied with h, = h, = 1. 





THEOREM 5.1. 
(i) Under (A1), (A3), and (A4) we have 
T= o? 1/2 
(5.2) ICH(y,, T, $) z adua Yo(1 — +y’), 
1- 2 
(5.3) ICH(y,,T™, ¢) = B >D nr (o), r,,(¢)), 
g=l 


where r(?) oS BA al Py, 
(ii) Let T, denote either T®™ or T,P4, and set 


(5.4) A = En*(u, v) 


with u, v independent N(0,1) random variables. If (A1)-(A4) and the Gaussian 
AR(1) model (3.4) hold, then 


n/*(T, —- $) >, N(0,V) 
with 


(5.5) V=(1 - 4) 5. 


796 R. D. MARTIN AND V. J. YOHAI 


THEOREM 5.2. Suppose that the AO model holds, i.e., w,= x, + v, in (2.2), 
with v, independent of x, Assume (A1)-(A4) and that E|v,\""+*: < œ, where k, 
and k, are as in (A2). 


(i) For independent outliers we have 








GM (r= gy? 2\1 
(5.6) IFyo,:(4,, T ,¢) = e uy — $0, (Xo + 09)(1 — 4”) 6 
and 
_ $? 
(5.7) IF yo, (4o; T; $) = B En(u, — $09, Ug + v9). 
(ii) For patches of outliers of length k > 2 we have 
(1 ~ 4°)” 
IF 40, «(Hes cM @) R -B 
(5.8) x [(k = 1)En(u + 0, - 609, (£o + v)(1 = 4)') 
+En(u, -= $09, (Xo + vo)(1 — 4)” 
and 
l-g 
IFyo, albo T™^, 6) = kB 


k-2 
x| Y (k-h-1)¢" Enlu, +o, — 609, Uinta Ota) 
hm] 


k-1 
(5.9) + È} $ En(u, + 0, — pvo, Ujan HDi) 
h=1 
k-1 
+ L $" En(u, — $09, Ui-a + Vi, T ptn) 
h=} 


+o" En(u, — $to Urp + ia) |. 


COMMENT 5.1. The expectations in (5.6)—(5.9) are with respect to the mea- 
sure #,, = 4,4, where y, yields all the necessary joint distributions for the x, 
and u, in the AR(1) model. Here the measure p, € P,, is quite general. We 
specialize to the leading case of degenerate measures ô, corresponding to v, = $, 
when computing IF’s in Section 5.3. 


5.2. MA(1) models. The following theorem gives the ICH’s and asymptotic 
variances of T?^ and T®™ for the MA(1) Gaussian model, as given in Examples 
3 and 4, respectively, of Section 3. The scalar-valued limit y functions for T ®^ 
and T° are given by (3.18) and (3.22), respectively. 


INFLUENCE FUNCTIONALS FOR TIME SERIES 797 


THEOREM 5.3. (i) Under (A1), (A3), and (A4) we have 


— 92)}72 
(5.10) ICH(y,, T°, 6) = - EI O): so(0)(1 — 67)”), 
(5.11) ICH(y,, 74,6) = 242%) 2 Fom n(r,(9), 7, ,(8)), 


j=l 
where B is given in (A4), r,(@) and s,(@) are given by (3.19). 
(ii) If T, represents the GM or RA estimate, and (A1)-(A4) and the Gaussian 
MA(1) model (3.5) hold, then 


(5.12) ni?(T, — 6) > 4 N(0, (1 — 67)A/B?), 
where A is given by (5.4), and B is given in (A4). 


The following theorem gives the IF’s of T° and T®4 for patches of length 
one (i.e., independent outliers). 


THEOREM 5.4. Assume (A1)-(A4), and that the AO model holds, i.e., w, = 
x, + v, with v, ~ p, and independent of x, Further suppose that the process 
{z}} is independent of the processes {x,,0,}, with z7 an i.i.d. Bernoulli se- 
quence. Assume also that E|v,|"*! < œ, where h = max(h,, h,) with h,, hg as 
in (A3). Then 


(1 - 67)'? 
TF,0,1( Hos T™, 0) = — 5 
(5.13) 
oO 
x È En(u, + 6-0, uz + j(1 — 87)'0/-»,) 
j=l 
and 
(1 — 6?) 
IF xo, 1( He: TRA, 6) eel TR 
(5.14) 


x YL En(u, + 6, u, +0), 

jo ry 
where u, and u, are independent N(0,1) random variables, with u,, u, indepen- 
dent of v,. 


COMMENT 5.2. From formulas (5.13) and (5.14) it is easy to see that for the 
MA(1) model the influence functional of GM and RA estimates is unbounded 
when 7 is monotone but bounded. Just take the supremum of the above influence 
curves over §, with u, = 6,. Thus boundedness of the y function in (3.18) and 
(3.22) does not insure boundedness of the IF for MA(1) models. This is a general 
feature of models with moving-average components. 


798 R. D. MARTIN AND V. J. YOHAI 


However, it is possible to show that when 7 is of the Mallows type (3.8) or 
HKW type (3.9) based on redescending yf, e.g., Ypg given by (5.15), the corre- 
sponding RA and GM estimates have bounded IF. These results are illustrated in 
calculations to follow. 


5.3. Some influence curve computations. Although general measures u, are 
used in the IF expressions of the preceding subsection, a leading case for 
expressing the intuitive notion of the “influence” of a configuration of contamina- 
tion points is obtained by using a degenerate measure for u, (cf. Sections 2.1 
and 2.3 on configurations). Thus for both iid. and patch “configurations” we 
shall let P(o, = [) = 1, for a constant f (among other possibilities one might let 
P(v, = §) = P(v, = —¢) = 4). This allows one to step down from the abstract 
view of the IF as a functional on measure space. One can now compute and plot 
IF’s as a function of the contamination value ¢, thus retaining the rich heuristics 
attraction of the ICH. Correspondingly, we use the term influence curve, IC = 
IC(é), to describe this special case of an influence functional 

We calculated IC({) for least-squares (LS), GM, and RA estimates of the 
HKW type at the following AO models: (i) AR(1) with both independent and 
patch outliers, and (ii) MA(1) with only independent outliers. Two psi functions 
are used for each of the choices of 7, namely the Tukey redescending bisquare 
function 


(5.15) yas (u) = (4b (u/ay)’, ul sa, 
Š 0, ul > a, 


and the Huber function 
(5.16) tu, alu) = min(a, max(u, ~a )). 


The tuning constants a were adjusted for each estimate to obtain 95% efficiency 
at a perfectly observed Gaussian AR(1) process. The values of the constants are 
given in Table 5.1. 

The results of these IC calculations are shown in Figures 1-3. Figure 1 shows 
the AR(1) results, with @ = 0.5, for independent outliers and patches of length 20. 
Figure 2 is the same except that ¢ = 0.9. The least-squares influence curve is 
quadratically unbounded in both cases. The general messages for the robust 
estimates are clear: (i) the bisquare psi function is preferred to the Huber psi 
function; (ii) the RA estimates are preferred over the GM estimates for indepen- 
dent outliers, while the reverse is true for long patches. 


TABLE 5.1 
Tuning constants 
HKW estimate Mallows estimate 
Vy 2.62 1.65 


Vas 9.36 5.58 


INFLUENCE FUNCTIONALS FOR TIME SERIES 799 


























Magnitude of Outliers in Patches of Length 20 


Fic. 1. Influence curves for the AR(1) model. Hampel~Krasker- Welsch estimates at ¢ = 0.5. 


Our preference orderings here are based on the gross-error sensitivities (GES’s) 
of the IC’s, the GES’s for these particular examples being simply the supremum 
of |IC(¢)|. The GES’s here have property (P3), just as in the case of the GES for 
Hampel’s influence curve ICH. 

Figure 3 shows IC’s for the AO MA(1) model with 8 = —0.5 and @= —0.9 
(with our sign convention this gives positive correlation for the x, process at 
lag-one). The results are in keeping with Comment 5.2: GM and RA estimates 
based on the monotone Huber psi functions have unbounded IC’s (though 
apparently not quadratically unbounded as in the case of LS), while the bisquare 
y function leads to bounded IC’s for the MA(1) model. Also, the GM estimate 
seems to be preferable to the RA estimate. 


800 R. D. MARTIN AND V. J. YOHAI 








BS-RA 


























Magnitude of Outliers in Patches of Length 20 


Fic. 2. Influence curves for the AR(1) model. Hampel-Krasker- Welsch estimates at ¢ = 0.9. 


This last observation surprised us, as we had been somewhat pessimistic about 
using GM estimates for MA and ARMA models since such estimates seemed 
particularly natural only for AR models. We are now motivated to take the 
possibility of using GM estimates for ARMA models more seriously, and under- 
take a careful study. 

We have carried out a parallel set of IC calculations based on GM and RA 
estimates of the Mallows type. The Mallows type IC’s are displayed in Figures 
4--6 of Martin and Yohai (1984a), and differ from the HKW type IC’s presented 
here only by virtue of having slightly different shapes and slightly larger GES’s. 


COMMENT 5.3. Some proposed robust ARMA model parameter estimates are 
not tractable with regard to obtaining closed-form expressions for their influence 














(a) 














Magnitude of Independent Outliers 
(b) 


Fic. 3. Influence curves for the MA(1) model. Hampel-Krasker- Welsch estimates (a) at 0 = —0.5 
and (b) @ = —0.9. 


functionals or influence curves. This is the case for example with the AM 
estimates based on robust filter cleaners (see, for example, Kleiner, Martin, and 
Thomson, 1979; Martin, 1981; Martin and Yohai, 1985). However, one can 
estimate influence functionals for such estimates via simulation. Some work has 
been completed along these lines, and will be reported elsewhere. 


6. Gross-error sensitivity. In this section we give a general definition of 
groas-error sensitivity (GES) based on the IF. Specific results are then given for 
the GES’s of GM and RA estimates of the first-order autoregression parameter. 


802 R. D. MARTIN AND V. J. YOHAI 


6.1. The gross-error sensitivity. Suppose that the contamination process y 
is given by (2.2) and its related assumptions. Then for a fixed set of measures 
{Broz} = (Mew: O S Y < 1}, we have a particular arc {p¥} = {p¥: 0 < y < 1} of 
contaminated process measures in P,,. Suppose that the IF (4.5) exists for a given 
family P of arcs {p}} with each arc in P,e- This family P will be generated by 
letting the contamination process measure u „ vary over a family P,, in 
a manner consistent with the dependency structure {1,,,}, while {7} = 
{ut: 0 < y < 1} and p, are fixed. Often {7%} will be independent of p,,,, and then 
P will be generated by letting p, vary over P, in a manner consistent with the 
dependence structure of y,,,. For the pure replacement model given in Section 
2.2, Hew = zl, 80 P is then determined by simply letting p,, range over P. In 
the AO model where w, =x, + v, with x, and v, independent, the measures 
Mew = Mx, x+ are specified by p, and u, and P is generated by letting u, range 
over a prescribed family P,. 


DEFINITION 6.1. The gross-error sensitivity (GES) of an estimate T at the 
family P of arcs {p} } is 


(6.1) GES(P,T) = sup |IF(x,,,P, {u%})|. 
{ny} eP 


COMMENT 6.1. Since ICH has leading argument y by virtue of being a 
directional derivative determined by the point mass contamination measure 6, 
Hampel’s (1974) GESH is the supremum over all y of [ICH(y)|. Our IF depends 
on the arc {u?}, with each p,, € P„ specifying a particular {4%} € P, and so our 
GES involves the supremum over arcs {4%} in P. 


COMMENT 6.2. In either a pure replacement model or an AO model where the 
family of arcs P is generated by P, or P,, respectively, a leading case of GES is 
obtained when w, = v, = §, and correspondingly P, = P, = {8;} where 5, is the 
point mass on R”, In this case we replace IF by IC (for influence curve), replace 
the argument u,, by §, take the supremum over all ¢, and replace the GES 
argument P by {pYn,}. 


6.2. GES computations. In Table 6.1 below we give GES’s corresponding to 
the Mallows estimate IC’s computed for AR(1) models in Section 5.3. Specific 
formulas which allow one to compute these GES’s are derived in Martin and 
Yohai (1984b). The estimates are matched by the choice of tuning constants a to 
have the same asymptotic efficiencies at the Gaussian model. This means that 
sup, ps, a(7)| > SUP-|Yu,a(7)) and correspondingly, sup, .|1ps, «(Ws ?)| > 
SUP, ITH, a(% 0)| Yet, as Table 6.1 shows, the GES’s for the bisquare psi 
function are often smaller than those for the Huber psi function. 

The GES’s for the HKW estimates are smaller than those of the Mallows 
estimates, but the differences are only slight. 


INFLUENCE FUNCTIONALS FOR TIME SERIES 803 


TABLE 6.1 
AR(1) GES'’s for addthve outhers 


$= 05 6 = 09 
Estimator kai k= %9 k mj k = 20 
LS os) œ oe) 00 
GM-H —-2.8 2.6 -14 1.3 
RA-H —2.5 4.3 —0.6 3.3 
GM-BS -1.9 1.6 -18 0.4 
RA-BS -1.5 3.1 -1.5 2.5 
TABLE 6.2 


MA(1) GES’s for additive outliers 


Estimator 0 = -0.5 0 = —0.9 


LS (oe) 00 
GM-H oo a) 
RA-H (oe) fos) 
GM-BS 4.0 10.0 
RA-BS 3.8 38.0 





COMMENT 6.3. In Martin and Yohai (1984b) it is shown that for indepen- 
dently located additive outliers, RA estimates have smaller GES’s than GM 
estimates, On the other hand, for long patches GM estimates are better. These 
properties are reflected in Table 6.1. 

Table 6.2 gives GES’s corresponding to the IC’s computed for MA(1) models 
with k = 1 (independent outliers) in Section 5.3. The striking feature here was 
already evident in the IC calculations: For the MA(1) model, a redescending psi 
function is needed to obtain a bounded IC. Although the GM-BS estimate is a 
bit worse than the RA estimate at 6 = 0.5, it is much better at 6 = —09 
where the x, process is correspondingly more highly correlated. 


7. An optimality property of generalized RA estimates. As one applica- 
tion of the IF in constructing good estimates, we define here a class of gener- 
alized RA estimates (GRA estimates) for the AR(1) model (3.4) and show that a 


member of this class has a certain optimality property which we shall establish. 
A GRA estimate for the AR(1) model is a ¥ estimate with , of the form 


PEPAy,, $) E E n (r$), r_,(¢), $), i = 3, 
yl 
where r (ẹ) = y, — $Y,- and the limit È function is assumed to be 


(7.1) UoPA(y,, p) = E a(7(#),7-(0),9)- 


804 R. D. MARTIN AND V. J. YOHAI 


The AR(1) RA estimate of Example 3 in Section 2 has y function (3.13), which is 
a special case of the above GRA y function. 

We will show that for a large subclass of general Ý estimators there exist GRA 
estimates with smaller asymptotic variance, and the same IF at AO models, when 
the outliers are independent and x, is a Gaussian AR(1) process. 

Consider a general ý} estimate with limit p function expressed in the form 


(7.2) IY $) = ġ*(r(¢), nole) $). 

We will define our “optimal” GRA estimate y* by (7.1) with the particular n, 
functions 

(7.3) n*(u, Ui -js $) = E| (u, Ug, ous y )|t;, u], 


where the u, are the autoregression innovations in (3.4). Call T the estimate 
based on y, and T* the GRA estimate based on *. 
We will use the following assumptions: 


(C1) ¥* is differentiable in each variable, and for all j > 1, k = 1,1 —j, 


(4/du,)n*( uy, Ui- ¢) = E[(0/du,)o* (uy, Ug, Uire p)lui, u], 
and 


(38/3t)n (urs t-z t)|,_, = E [(8/2u,) 


xý+ (u, Ugs Uis. t)|u, T [eng 


(C2) Y*(-a, Qo, Ag,.--, ¢) 5 -ý+ (a, Ag, Q3.. -; $). 
(C3) Y (a üg, — az, ones $) = y+ (a, Qa, a3; As | >). 
(C4) ¥(y,, $) is bounded in y; for each ¢ € (— 1,1). 
(C5) The order of integration and differentiation in (4.2’) may be interchanged, 
and with D = (0/09), 
C; = ED} (x,,) #0. 


The same is true of y* in place of . 

(C6) T and T* satisfy (4.6), and satisfies the conditions of Theorem 4.2. 
(C7) T and T* are asymptotically normal with variances given by (4.3)—(4.4). 

Notice that the GM estimates of the Mallows and Hampel type with odd ẹ 
functions satisfy (C2) and (C3). (C2) guarantees the Fisher consistency of the 
estimate when the u,’s have a symmetric distribution. 

The following theorem gives the optimality property of the GRA estimate T * 
based on y*. 


THEOREM 7.1. Assume (C1)-(C7), and suppose that x, is a Gaussian AR(1) 
process. Then 


(i) AVAR(T*) < AVAR(T), where AVAR denotes asymptotic variance when 
y, t = x. 


INFLUENCE FUNCTIONALS FOR TIME SERIES 805 


(ii) Suppose that v, is independent of x,. Then 
IFyo1(Ho: TS; $) = TF, (Bo T, $). 


EXAMPLE 7.1. For a GM estimate of the Mallows type we have 
lyne) = yn- $3) ¥( (1 z #)^) 
and the corresponding GRA estimate is given by 
a(n, Mop >) = v(r,)A(d,(6)7, -j s,($)), 


where 

(7.4) d,(¢)=¢7"(1 - ey)”, 8,(¢)= h-(1- p), 
A(a, b) = Ey(a + bu), 

with u ~ N(O,1). 


EXAMPLE 7.2. For the HKW-type GM estimate 
Jye) = ¥((%1 — $3) (1 - 4)'”) 
and the corresponding GRA estimate is given by 
a(n, Thay» $) = Ald (b)n + 8,()r,). 
We note that if 4 is of the Huber type Ųų, o then 


ta) ar 
-—(ce- a)Fy| "| — a, 


where f, and Fy are the standard normal density and distribution functions, 
respectively. 


A(a, b) = [DI 








+(e+ a)Fy| "| 


8. Generalizations and further applications. In defining the influence 
functional for time series, we have for convenience concentrated on: (i) the use of 
a particular type of contamination model, namely the general replacement 
model (2.2), and (ii) application of the influence functional to the study of some 
particular estimates of AR(1) and MA(1) parameter estimates. Neither of these 
two narrow points of focus adequately reflects the potential utility of the 
time-series influence functional as a tool for studying the effects of many types of 
contamination on statistics used in a wide variety of time-series problems. 

With regard to contamination type, any realistic model of contamination can 
be used for which the derivative (4.5) defining the influence functional exists. It 
would of course be extremely helpful if the IF admits a tractable analytic form, 
but this is not absolutely essential (see comment at end of Section 5). 

As for IF’s for other time-series statistics, the following two examples should 
give some indication of the range of possibilities which remain to be explored. 


806 R. D. MARTIN AND V. J. YOHAI 


8.1. Testing for white noise. Given a zero mean time series y,,..., y, with 
measure 4, one can test for whiteness of the series using the statistic 


L 
Le 2 
Va 7 Erê, 

i=l 


where 
ee ee NI 
i Epy 
is the lag-/ autocorrelation estimate. The functional Tọ, , = Ty ,(u,) associated 
with V, is 
L 
Ty, i Pace L pis 
lel 
where 
ae Ey Ji- 
‘Ey? 
is the lag-l autocorrelation estimate. 
Under the null hypothesis that p, = p9, is a white-noise measure, nV," con- 


verges in law to a chi-squared distribution x? with L degrees of freedom. In this 
case it is easy to check that for the general replacement model (2.2) 


IF( Hy, Tw, 1{u4}) = 0 


for any p,, and any arc {p}} such that uY, > HS as y > 0. This is the usual von 
Mises expansion type of result for a statistic having an asymptotic chi-squared 
distribution. One therefore expects a nonvanishing second derivative, and hence 
we define the second-order influence functional for such cases as 


TF" ny, Tw, r {83}) = (87/87?) T (23) 1,0: 
For contaminated processes uY, with patches of length k, one finds that 
IF? (1,,, Ty, L» {u?}) 





DS í ; 2 
E Z [min(k, J) Exw,_, + Ew,x,_,) + max(k — j,0) Ew, _,| : 
x j=l 


When x, is independent of w,_,, |/| = 1,2,..., this expression reduces to 


1 min(L, &-1) 3 F 
L (k- 1) (Eww). 


= 732-4 
k*o; [-1 





IFO (Hos Tw, 2» {¥5}) 


In the AO case where w, = x, + v,, with x, white noise and v, independent of x,, 
we have 


Ew,w,_,= Ev,v,_), Le I. 


INFLUENCE FUNCTIONALS FOR TIME SERIES 807 


If v, = & this gives the influence curves 


¢4 1 min(L, k—-1) 
Ici =a YL (k-D’, k=1,2,.... 
Oz k i=l 
We have IC(£) = 0 as expected, and ICP) increases with k until k = L + 1, 
as is intuitively reasonable. 
Of course, since IC@ is unbounded, the test statistic V? is not robust, and one 
is motivated to find a robust alternative. 


COMMENT 8.1. Since the same test can be obtained using different test 
statistics which have different influence curves, it is necessary to standardize a 
test statistic in order to make fair comparisons in terms of influence curves. This 
can be done for example as described by Ronchetti (1982) and Lambert (1981). 
Similar standardizations should be considered in order to convert IF{” to an 
influence curve for tests. 


8.2. Spectral density estimates. Let S(F) = S(f,u,) denote the spectral 
density functional of a stationary, zero mean process y, with measure p,. Of 
course S( f) in fact depends only upon second-order properties of y, It is 
common practice to estimate S( f) with a smoothed version S(f) of the 
periodogram based on “tapered” data, smoothing being needed to obtain con- 
sistency. Such estimates may be written as 


S(f) = È w (DÈ (de, 


lm ~n 


where w,(/) is an appropriate “lag window” and 


1 n 
R,(1) Te 2 (%- F Y-a -7) 
KRETEN 
is an estimate of the lag-l covariance R(/) = R(l, u,) of the process y, Under 
certain conditions RY is a consistent estimate of R(/), 7 = 0,1,..., and S,(f) 
is a consistent estimate of S( f ). Thus, the functional associated with S,( f ) is 


S(f)= È RU, 


l= -— w 
The spectral density S( f ) is an infinite-dimensional parameter and we get an 


infinite-dimensional influence functional IF(u,,, S, {43}) = F(u p f, S, (u5), 
through the pointwise definition: 


ALE Ae) rel- J 


IF(u 2 f, S, {u3}) = lim 9? 9 


y>o0 


The abbreviated notation IF,(p,,, f ) will be convenient. Denote the influence 


808 R. D. MARTIN AND V. J. YOHAT 


functional for the lag- covariance IF,(u,,, 2) = IF(u,,, R(Z), {u3,}). Then 


Fg (H w> f) = x TFa( Ha, Lett, 
l=- œ 
For the AO model with independent outliers of amplitude ¢, the influence curve 
for R(L) is 


2 = 
en(a 


which gives the spectral density influence curve 
Ic(é j) =€, fel- tl. 


As one might expect, the influence curve in this case is quadratically unbounded, 
and independent of f. Other kinds of contamination—such as constant-level 
patches, sinusoidal patches, or patches with nonflat spectral structure—will lead. 
to influence functionals and influence curves for S(f) which depend on f in a 
nontrivial way, and are quadratically unbounded in the amplitude of contamina- 
tion. It will be interesting to compute IC, for these and other types of con- 
taminations motivated by applications. 

Robust alternatives to the smoothed periodogram estimates have been pro- 
posed in Kleiner, Martin, and Thomson (1979) and Martin and Thomson (1982). 
We intend to study these estimates in terms of their influence functionals and 
curves, which will probably have to be computed via simulation. 


9. Proofs of theorems, 
PROOF oF THEOREM 4.1. According to the definition of a ¥ estimate 


m(y,T(13)) = 0. 


By assumption (a) there exists y) > 0 such that /T(u}) — to] < e for Y < Yo 
where e is as in assumption (b). Therefore by (b) and the mean-value theorem we 
have for y < min(¥p, £), 


m(y,to) + D(y,t*(y))(T(u) — T(u,)) =0 


and, by assumption (a), t*(y) >t) as y—0. Then by assumption (b), 
D(y, t*(y)) > C = D(0,t,) as y > 0. Then (4.6) follows. 0 


PROOF OF THEOREM 4.2. (i) Patches of length k are generated by letting 
zp = max( ŽP, Žf 1,..., Žika 1) 


where the ZP are iid. Bernoulli random variables with P(Z? = 1) = p, and 


INFLUENCE FUNCTIONALS FOR TIME SERIES 809 


y = kp, with k fixed. For fixed m > 1, put 


T m-l 
(9.1) Chr = {zp,=1}9| N (22,=0}|, Osj<m-1, 
Ae 
m—i 
(9.2) cè = N (22,=0}, 
=0 
m-i 
(9.3) Che m= U [fat = 1} ^ {z =1)}] 
tj 
i*J 


Then for 0 <7 < m — 1 we have 


(9.4) P(C}P) =p + d, „(p) =p + o(p). 
We also have 

(9.5) P(Ch?) =1+4 d; (Pp) =1+ O(p) 
and 

(9.6) P(C m) = ds, m( P) = o(p). 


According to the definition of T we have 
E¥(yz,T(ny)) = 0. 
Then 
E¥(y},to) + E[¥(y7,T(u3)) - ¥(v7 to)] = 0. 
Since C*?,0 <j < m + 1, isa partition of the sample space of (2?, 2%, ..., 2? m) 


we have 

(9.7) EW(y},to) + È elm, p)P(C}2) =0, 
j=0 

where 

(9.8) e,(m, p) = E[¥(y7,T(u)) - (y7, to) IGA? 


For j 2 0, let y, , = (1, j Yọ p- -) be given by 


(wi,x_,), O<jsk-1, 
(9.9) Yi, 7 l1-J+k 1-) x 
(x! Wh Xj) J2 k. 
Conditioned on C},?, for O<j<m-1 we have y}_,=5,_,,, For 0<i< 
m — k and for i > m — k, yj_, is either x,_, or w,_,. Thus we have 


(9.10) |E[¥(y7,t)ICh?] - E¥(y,,,t)|< Halt), O<jsm-1. 


810 R. D. MARTIN AND V. J. YOHAI 


Then 
le,(m, p)| < |E|¥(y,, ,,T(43)) = ¥(y,, ,,to)| | 


+A ks (T(u%)) + Hy psilto)- 


Therefore by assumptions (a) and (e), and the dominated convergence theorem, 
we get 


(m, p)| s b (m, p) + Hp-ra(T(8)) + Hm-r+i(to), 


(9.11) 

0<jsm-1, 
where 
(9.12) lim b,(m, p)=0. 
We will now show that for |t — to| < £, where e, is as in assumption (c), we have 
(9.13) Z[¥O7,HIC2] -E¥@,t)|<p E Hit). 

t=m—k+1 

We have 


E[¥(y7,t)iCh2] = (1 - p)E[ YO, HIG, 2P m = 0] 
+pE|¥(y7,t)ICh A, f-m = 1]. 
Since conditionally on C}:?,, yf_,=2x,_,0<i<m— k, we have 
|E [E(y7,t)CE 2, ZP m = 1] - E¥(x,,t)| < A, 2,,(t). 
Then 
|E[¥(y7, Hick: A] - E¥(x,,1)| 
< (1~ p)|E [F(Y], CRA, mai] — B¥(x:, t)| 
+plE[P(y?, tiC*:?, 2?_,, = 1] - E¥(x,,t)| 
S|E[V(y},t)ICE 4, mar] — E(x, t)|+ PA —ai(t). 
Iterating this relationship, we get 
|E[¥(yy,t)|C22.] - E¥(x,,t)| 


m-k+h+1 


j s P 2 H,(t) + |E[¥(y7, HCPA ninal an; E¥(x,,t) |. 


t=am—k+1 
Since the second term of the right-hand side is no greater than H,,,,_44(t), 
which by (c) tends to 0 for |t — to] < eo, we get (9.13). Using (9.13) we get 


le,(™, p)- E|&(x,,T(ut)) z ¥(x,,t))| | 


<p ¥ [H,(T(ut)) + H,(to)]. 


t—=m—k+1 


INFLUENCE FUNCTIONALS FOR TIME SERIES 811 
Then according to (a) and (b) we get 
|c,,(m, p) ~ C[T(ut) - T(n,)]| 


9.14 2 

oo sa(p)+p_ X ‘LA(tO3)) + H,(to)], 
where 

(9.15) a(p) = o(p). 

We also have 

(9.16) len +i(™, p)| < Hy(T(ut)) + Ho(to). 


Using (2.5), (9.4)-(9.7), (9.11)—(9.12), and (9.14)-(9.16), along with assumptions 
(a), (c), and (d), straightforward computations give (see Martin and Yohai, 1984a, 
for details): 


EW(y},t))  C[T(n?) = TOJ 
+ 
g(y) gly) 
and thereby (4.6) follows. 
(ii) Using (9.1), (9.2), and (9.3), we have 


E¥(y),t,) = p “E[#(y7, +.) he] PGA). 





O(1) 


Then using (9.4), (9.5), and (9.6) we get 








EẸ(y?, to) 18 

wy) TAS 
+d, ,(P) 

ay |z[¥(y7,t,)iC4?] - G| 

(9.17) Hi- zo Eeti E iG* 
1+d, ,,(P) ee 
— Elto sty) [C22] 
d3, ml P) p 


POR E|[%(y? sto )iCh mE: ak 


Since G* = EW(y,, ,, to), 0 <j < m — 1, where y, , is given by (9.9), iaeoa 
gives 


(9.18) |E[E7,to)CtR] - G| < H, gii(to), Osj<sm-1. 


812 R. D. MARTIN AND V. J. YOHAI 


We also have 
(9.19) E[¥(y7,to)IC27, m] < Halto). 


Since E¥(x,,t,) = 0, and since x, and Yı, have the same first 7-k+1 
components when j > k, we have 


(9.20) IG <s H,_p4i(to): J 2 k, 
and by the definitions of G* and H,(t) we have 
(9.20) IG} < H(t), O<j<sk-1i. 


Conditioned on C% ?,, x, and y? have the same first m — k + 1 components, and 
so (9.13) gives 


oO 
(9.21) |E [Ey to) Cka] se E (to). 
t=m—k+1 
Using (2.5), (9.4)-(9.6), and (9.18)—(9.21), along with assumptions (a), (c), and (d), 
we get (4.8) by straightforward computation (again, details may be found in 
Martin and Yohai, 1984a). 0 


ProoF OF THEOREM 4.3. By (4.6), (b), (c), and the dominated convergence 
theorem, we have 


(9.22) TP( 14a, (H5}) = lim €- pena 

Note that 

(9.22) Eg (y? to) = E[%(x:, ,0)124,1, = 0](1 - e70) 
+E[n,(yi 1,to)l21,1, + 0] e*(7), 


where g*(y) = P(z], 1, + 0). Also 
OF En,(x,,1,to) = E|a,(x,,,,to)|27, ee o((2 - g3(y)) 


+E [m(x 1, to)iz7, z + ol g*(7) 


Then we have 


& 
|E [(x:,1.t0)l22,1, = 0||< i pa 
Since g¥(y) < hg(y), using (9.22’) we have 
[Em (y} 1 to)| < K o| Say +1] 


for sufficiently small y, and so (2.1) and (9.22) give (4.9). O 


INFLUENCE FUNCTIONALS FOR TIME SERIES 813 


Proor oF THEOREM 5.1. (i) For the GM estimates, straightforward compu- 
tations show that 


ð 
(9.23) C= E) "u py 
For the RA estimates 
B 
(9.24) dare 


and the result (i) follows from (6.23) and (6.24). 
(ii) The result is proved in Bustos (1982) for the GM estimates, and in Bustos, 
Fraiman, and Yohai (1984) for RA estimates. O 


Proor oF THEOREM 5.2. Using (A1)—(A4) and E|v,|":**: < oo, it is easy to 
show that there exist solutions 26M = T°M(u%) and tP^ = TPA(yY) of 


En( y? - tx, = t?)'”? yz) =0 
and 
o0 
È en yf — tg, Wt; 1) = 0, 
j=l 
respectively, such that <M > p and ¿P^ — @ as y — 0. Then assumption (a) of 


Theorem 4.2 is satisfied. It is easy to check that the other assumptions of 
Theorem 4.2 are also satisfied, and so TS and T™ satisfy (4.8). 


Let y,, = (91, y Yo, j---) be given by 


(xiv +vi7,x_,), fO0<j<k-1, 
MV gute tv), jak. 
Then 
= E¥(y,,,, $). 
For GM estimates Ẹ(y,, $) = n(71 — $% Yo(l — $”)'/), and so we get 
0, J=0, 
Gra En(u, + 0; — $09, (Xo + vo)(1 — gy"), Jsk-1, 
i En(u, -= $0, (£o + %)(1 — ¢7)'”), j=k, 
0, j> k. 


Therefore by (4.8) and (9.23) we have 


(1-9*)'” 
TF ao, rtp, T™, $) = — B 


which gives (5.8), with (5.6) as a special case. 


[(k - 1)G* + G$], 


814 R. D. MARTIN AND V. J. YOHAI 


Now for TPA, d(y,, $) is given by (3.13), and so 


fe a) 
ate Reet, 
hel 
with 
Gi, > Enl y; = Yo, j Yi-h,j ns ®¥_n,)) 


Using the assumptions on 7 gives 


0, h>j, 
0, J >k, 
Gh En(u, — $09, Uia + 01-4), h=j=k, 
Dho En(u, — $09, Uia + 21-1 — PPa), h>j=k, 
En(u, + 0, — tos li-a + ti-a) h=j<k, 


En(u, + 0, — $0),U,_,+0,_-,—- ¢0_,), A<j<k. 


Therefore by (4.8) and (9.24) we have 





1— g fk-2 i 
IF(p,, T™, 9) = E yeh 1)o" GE aa +h ¢" G$ i, k-1 
h=1 h=1 


k-1 
+ L ¢'Gk at fatal 
h=1 


This gives (5.9), with (5.7) as a special case. O 


PROOF oF THEOREM 5.3. (i) Straightforward computations show that for 
T° we have C = B/(1 — 67)”, and for T®4 we have C = B/(1 — 6”). Then 
(5.10) and (5.11) follow. 

(ii) The result is proved in Bustos, Fraiman, and Yohai (1984) for RA esti- 
mates. For GM estimates, it can be proved similarly. O 


PRoor OF THEOREM 5.4. (A1)-(A4) and E|y,|"**: < oo imply as in Theo- 
rem 5.2 the existence of solutions T°M(u%) and T®4(u%) which converge to the 
true parameter 6 as y > 0. Then assumption (a) of Theorem 4.2 holds. We will 
show that assumption (c) is satisfied. We will assume h, = h, = 0, but the proof 
in the other cases is similar. 

For GM estimates we have 


TE | Eon (1-02)? F 04 Dey). 


1=0 =0 


Straightforward computations show that 


or [m0r 0” 
H„(0) < zem” a + | d a + aay fajo -— “| 








INFLUENCE FUNCTIONALS FOR TIME SERIES 815 


where K is a bound on the partial derivatives of 4, and M = E|x,| + E|o,|. 
Therefore assumption (c) is satisfied for T°™, For the RA estimates 


æ æ 00 
vy. 6) = > 67" 'y L bY L 6'yy_ 5-4 ’ 
j=l 1=0 i=0 
and straightforward computations show that 
2KM\0\" 2KMmļ6 =- 2KMj6|™ 
a-w 1 — 8 


Thus assumption (c) of Theorem 4.2 is satisfied for T P^. It is easy to check the 
remaining assumptions of Theorem 4.2 for both T°™ and TP^, Therefore (4.8) 
holds for these estimates. 

Now we need to evaluate (4.8) for TCM and T®4. For GM estimates 


G} a En(u, + Ov, (1 Eg 02)? (89 + j0/-v,_,)). 


Since (1 — 6*)'/*s, is N(0, 1) and independent of u,, and C = B/(1 — 67)'/?, (4.8) 
gives (5.13). For RA estimates we have 


H,,(9) < 


TE 


J 
G! = È 0% Ey(u, + 0%, ua + Oop) 


J 
hal 
+ È o*Ey(u, + bw,» uza). 
h=j+1 
By (Al), En(u, + 04,_,, u,_,) = 0. Since C = B/(1 — 0°), interchanging the 
order of summations gives (5.14). 0 


PRrooF OF THEOREM 7.1. (i) It is easy to check that when y, =x, is an 
AR(1) process, the V, in (4.4) are zero for l > 0. For Ņ satisfying Theorem 4.2 we 
have 


ICH(y,) = ICH(x,, ¢) = C-'¥(x,, ¢) 
with C = Cj given in (4.2’), and so from (4.4) we have 
vY) a Cy Ey?(x,, $). 


Our proof consists of showing that Cj = Cz., and that Ep*(x,, $) < E¥?(x,, $), 
which implies Va(4*) < V (4). We first show that 


(9.25) E¢™ (x, $) < EV*(x,¢). 
In order to prove (9.25) it is enough to show that 
E[¥(x,, $) - o*(x,, ¢)] $*(x,, $) =0. 
Let 
u, = u,($) = (t, Uo, u-i...) = (U,(¢), Uo(), u_1(¢),---)- 


816 R. D. MARTIN AND V. J. YOHAI 


Then in view of the definition of ¥* and ¥*, it suffices to show that 
(9.26) Ey*(x,, o)n,(4,, li- $) = Ey*(ua,, o)7,(u,,U,_,, >), 
i= 1,2,... 
But 
Eğ*(u,, ġ)n, (u, Ui- $) = En, (u, Ui- $)E [4+ (u, ọ)lu, ui] 
= Eni(u,,u,_,,9), 

and since (C3) implies 

En, (u, ui $)n (i, u, $) = 0, i# J, 
we have 

Ey*(x,, )n,( uy, Uy 4, o)= Eni (u, Uii $), 
from which (9.26) follows. 

Now we will establish that C; = C;., that is, 
EDY(x,, $) = ED{*(x,, $), 

where D = 0/d¢. We have 


D(x, $) = — È ni (wo), 9) + Ji (ule), $), 


where 
bi (a), az... $) = (3/3a,) Y+ (ay, azs...) 
and 
Jila, az,..., $) = (3/3) (a, dg,..., 6). 
By (C2) we have 
Ex,_*(u,,¢)=0, i>l, 
and 
Ey; (u,, >) = 0. 
Therefore, 
ED#(x,, $) S —Exob{(u,, $) zo L ¢ 'Eu, ý] (u, >). 
w=1 
Since (C1) gives 
i Eu, i (u, üg,..., $) = Eu, E [Ẹ} (u, üo...» $) |i, u] 
P Eu, _,(3/3u,)n, (u, Ui- ), 
we have 
ED¢(x,, p) aig 2 ¢'Eu,_,(0/du,)n,(u,, Ui- $). 
t=1 


Now similar reasoning gives exactly the same expression for EDj*(x,, $), and 
the proof of (i) is complete. 


INFLUENCE FUNCTIONALS FOR TIME SERIES 817 


(ii) By Theorem 4.2 we have 


1 œ 
TFao, (Hos T, >) = TE L G}, 








1=0 
where 
C= ED¥(x,, $), 
Gp = Ed*(u, + 0,, Uo, U_y,---,@), 
and for i > 1, 
G! = Edt (uy,..., Ug_,) Uo, — $01_,, Uy_, FOU pens G). 
(C2) implies G} = 0 and (C8) implies G! = 0 for i > 2. Therefore 


1 


1 
gs = — Gen, — Vo, Ug + Uo, $). 


TF yo, 1(Ho> T, $) = 


It is easy to check that for T * 


qi ={% . , 
i En,(u, — $to, Uo + o$), t=Oandi>2 


and so one also has 


1 
TFyo (Ho T*,o) = - Gena — $09, Uy + vo, È). oO 


Acknowledgments. The authors appreciate the help of Judith Zeh, who 
wrote the code for computing the influence curves displayed in Figures 1-3. She 
also prepared the final plots, using the interactive statistical language and system 
S (Becker and Chambers, 1984). We wish to thank the Institute of Pure and 
Applied Mathematics (IMPA) in Rio de Janeiro for their generous support for 
two months of 1983, when this research was in its incipient stages. We also would 
like to thank the referees of the paper for very thorough and illuminating 
reviews. 


REFERENCES 


ANDREWS, D. F., BICKEL, P. J., HAMPEL, F. R., HUBER, P. J., Rocers, W. H. and TUKEY, J. W. 
(1972). Robust Estimates of Locaton-Suroey and Advances. Princeton Univ. Press, 
Princeton, N.J. 

BECKER, R. A. and CHAMBERS, J. M. (1984). S: An Interactive Environment for Data Analysis and 
Graphıcs. Wadsworth, Belmont, Calif. 

Bustos, O. (1982). General M-estimates for contaminated pth-order autoregressive processes: con- 
sistency and asymptotic normality. Z. Wahrsch. verw. Gebiete 59 491-504. 

Bustos, O., FRAIMAN, R. and Youal, V. J. (1984). Asymptotic behavior of the estimates based on 
residual autocovariances for ARMA models. In Robust and Nonlinear Time Series 
Analysts (J. Franke, W. Hardle, and D. Martin, eds.) 26—49. Springer, New York. 

Bustos, O. and Youays, V. J. (1986). Robust estimates for ARMA models. J. Amer. Statist. Assoc. 
81 155-168. 

CHERNICE, M. R., DOWNING, D. J. and PIKE, D. H. (1982). Detecting outliers in time series data. J. 
Amer. Statist. TI 7438-747. 


818 R. D. MARTIN AND V. J. YOHAI 


DENBY, L. and MARTIN, R. D. (1979). Robust estimation of the first order autoregressive parameter. 
J. Amer. Statist. Assoc. 74 140-146. 

HAMPEL, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statst. 
Assoc. 69 383-393. 

HaMPEL, F. R. (1975). Beyond location parameters: robust concepts and methods. Bull. Internat. 
Statst. Inst. 40(1) 375-382. 

HAMPEL, F. R. (1978). Optimally bounding the gross-error-sensitivity and the influence of position in 
factor space. Proc. ASA Statist. Computing Sec. 59-64. Amer. Statist. Assoc., Washing- 
ton. 

HUBER, P. (1981). Robust Statistics. Wiley, New York. 

HUBER, P. J. (1983). Minimax aspects of bounded influence regression. J. Amer. Statist. Assoc. 78 
66-72. 

James, B. R. and JAMES, K. L. (1983). On the influence curve for quantal bioassay. J. Statist. 
Plann. Inference 8 331-346. 

KELLY, G. E. (1984). The influence function in the errors in variables problem. Ann. Statist. 12 
87-100. 

KLEINER, B., MARTIN, R. D. and THomson, D. J. (1979). Robust estimation of power spectra. J. 
Roy. Statst. Soc. Ser. B 41 313-3651. 

KRASKER, W. S. and WELSCH, R. E. (1982). Efficient bounded-influence regression estimation. J. 
Amer. Stat. Assoc. 77 595-604. 

Kownscu, H. (1984). Infinitesimal robustness for autoregressive processes. Ann. Statust. 12 843-863. 

LAMBERT, D. (1981). Influence functions for testing. J. Amer. Statist, Assoc. 76 649-657, 

MA tows, C. L. (1976). On some topics in robustness. Bell Labs Technical Memo, Murray Hill, N.J. 
(Talks given at NBER Workshop on Robust Regression, Cambridge, Mass., May 1973, 
and at ASA-IMS Regional Meeting, Rochester, N.Y., May, 1975.) 

MARTIN, R. D. (1980). Robust estimation of autoregressive models (with discussion). In Directtons in 
Tıme Series (D. R. Brillinger and G. C. Tiao, eds.) 228-262. IMS, Hayward, Calf. 

Martin, R. D. (1981). Robust methods for time series. In Applied Tune Analysis Serves II (D. F. 
Findley, ed.) 683-759. Academic, New York. 

MARTIN, R. D. (1982). The Cramér—Rao bound and robust M-estimates for autoregressions. Bto- 
metrika 69 437-442. 

MARTIN, R. D. and Jona, J. (1977). Asymptotic properties of robust generalized M-estimates for the 
first-order autoregressive parameter. Bell Labs Technical Memo, Murray Hill, N.J. 

Martin, R. D. and THomson, D. J. (1982). Robust-resistant spectrum estimation. Proc. IEEE 70 
1097-1115. 

MARTIN, R, D. and Yonas, V. J. (1984a). Influence curves for time series. Technical Report No. 51, 
Dept. Statistics, Univ. of Washington, Seattle. 

Martin, R. D. and YouHAI, V. J. (1984b). Gross-error sensitivities of GM and RA-estimates. In 
Robust and Nonlinear Tume Serves Analysis (J. Franke, W. Hardle and D. Martin, eds.) 
198-217. Springer, New York. 

MARTIN, R. D. and Yoma, V. J. (1985). Robustness in time series and estimating ARMA models. In 
Handbook of Statistics 5 (E. J. Hannan, P. R. Krishnaiah and M. M. Rao, eds.) 119-155. 
Elsevier, New York. 

MICHAEL, J. R. and ScHucany, W. R. (1985). The influence curve and goodness of fit. J. Amer. 
Statist. Assoc. 80 678-682. 

Portnoy, S. L. (1977). Robust estimation in dependent situations. Ann. Statist. § 22-43. 

RONCHETTI, E. (1982), Robust testing in linear models: the infinitesimal approach. Ph.D. disserta- 
tion, ETH, Zurich. 

SAMUELS, S. J. (1978). Robustness of survival estimators. Ph.D. dissertation, Dept. Biostatistics, 
Univ. of Washington, Seattle. 


DEPARTMENT OF MATHEMATICS 
UNIVERSITY OF BUENOS AIRES 
CUIDAD UNIVERSITARIA, PABELLON 1 
1428 BUENOS AIRES 

ARGENTINA 


DEPARTMENT OF STATISTICS, GN-22 
UNIVERSITY OF WASHINGTON 
SEATTLE, WASHINGTON 98195 


DISCUSSION 


Davi R. BRILLINGER! 
University of California, Berkeley 


Professors Martin and Yohai are to be complimented for their topical, 
thoughtful paper. In the paper they have emphasized population aspects of the 
material. In my discussion I will emphasize the data side. The two sides are both 
complementary and intersecting. 

There is a circle of interrelated ideas: influence, sensitivity, deletion, resis- 
tance, leverage, robustness, and jackknifing. Work appears to progress on all of 
these fronts more or less simultaneously with algorithmic and computing ad- 
vances often providing exogenous impetus. I will present a data analysis made 
possible by some contemporary time series methodology and easy availability of 
minicomputers. 

The concern of Professors Martin and Yohai is to extend the concepts and 
methods of “influence” to the time series case. They proceed by examining the 
effects of contaminating the data, by studying for example gross-error sensitivity. 
In the i.i.d. case an immediate way to study the influence of a possibly incorrect 
data point is to delete it and to carry through the inference procedure for both 
the full and depleted data sets. Because of the invariance of the structure under 
permutations of the data, in the i.i.d. case ways forward are clear; however, as 
Professors Martin and Yohai emphasize, the permutation invariance is not 
generally present in the time series case. There is, however, a way to retain the 
full time series structure and still do deletion /jackknife type studies. 

A long time ago (Brillinger, 1966) I suggested that a way to develop jackknife 
procedures for complex situations was to apply a missing-value technique. Briefly, 
on deleting the observation one is to act as if the data then consist of what it is 
but that that observation is missing. Luckily, nowadays we have many concep- 
tual and methodological means for handling data with missing values. A way 
forward for studying the influence of individual observations in a variety of time 
series situations is now clear. In that connection it may be remarked that the 
procedure is a form of sensitivity analysis. Namely one is studying the effect of 
altering an observation to its “best” estimate based on the remaining data in 
some sense. 

Resulting from the work of Ansley and Kohn (1984), Harvey and McKenzie 
(1984), Jones (1984), and Shumway (1984) there are a variety of methods to fit 
finite parameter (ARMA) models to discrete time series data having some missing 
values. In the calculations to be presented, the method of Jones (1980) was 
employed. Figure 1 is a graph of the logarithm of the Mackenzie River series of 
annual Canadian lynx trappings for the years 1821-1934. These data are studied, 
and much discussed, in Campbell and Walker (1977) and Tong (1977) for 


‘Research partially supported by the National Science Foundation Grant DMS-8316634. 
819 


820 DISCUSSION 


Logarithm Canadian Lynx Counts 


1820 1840 1860 1880 1900 1920 1940 
year 


Fic. 1. 


example. Taking note of Tong’s fit of an ARMA(3, 3) model to this series, that 
was the principal model that I worked with. The method of fit was maximum 
likelihood, assuming the process to be Gaussian. Figure 2 gives the residuals 
(difference between observed and predicted) for the ARMA(3,3). There are 
indications of lack of fit (indeed Tong (1983) goes on to fit nonlinear models); 
however, this model has sopped up a lot of the variation. 

The ARMA(3,3) model was next fitted to the data, by maximizing the 
likelihood, dropping out each of the 114 observations in turn. Six coefficients and 


Residuals from ARMA (3, 3) 


0.5 1.0 1.5 


-0.5 


t820 1840 1860 1880 1900 1920 1940 


INFLUENCE FUNCTIONALS FOR TIME SERIES 821 


Coefficients First Principal Component - Deletion Values 





Q820 1840 1860 1880 1900 1920 1940 


the innovation variance were estimated each time. (A program for fitting ARMA’s 
with missing data was run each time. It is clear that an algorithm could be 
developed to reduce the computations involved, as is the situation in the iid. 
case.) The coefficient estimates obtained were highly correlated. Rather than 
presenting six pictures of them, a principal component analysis of them was 
carried out. Figure 3 provides an (index) plot of the first principal component 
value versus the year of the deleted observation. A number of cases are seen to 
stand out, i.e., be apparently influential. In part these cases seam to correspond to 
“kinks” in the original series. 

Professors Martin and Yohai have discussed bounding the influence of individ- 
ual cases. This is a natural next step in the lynx data analysis, so I end by asking 
the authors how they would recommend fitting an ARMA in a bounded influence 
manner for the individual point case? 


REFERENCES 


ANSLEY, C. and Kony, R. (1984). On the estimation of ARIMA models with missing values. In Tune 
Serves Analysis of Irregularly Observed Data. Lecture Notes in Statist. (E. Parzen, ed.) 25 
9-37. Springer, New York. 

BRILLINGER, D. R. (1966). Discussion of Mr. Sprent’s paper. J. Roy. Statist. Soc. Ser. B 28 294. 

CAMPBELL, M. J. and WALKER, A. M. (1977). A survey of statistical work on the Mackenzie River 
series of annual Canadian lynx trappings for the years 1821-1934 and a new analysis. J. 
Roy. Statist. Soc. Ser. A 140 411~431. i 

Harvey, A. C. and MCKENZIE, C. R. (1984). Missing observations in dynamic econometric models: a 
partial synthesis. In Time Sertes Analysis of Irregularly Observed Data. Lecture Notes in 
Statist. (E. Parzen, ed.) 25 108-133. Springer, New York. 

JONES, R. H. (1980). Maximum likelihood fitting of ARMA models to time series with missing 
observations. Technometrics 22 389-395. 


822 DISCUSSION 


Jones, R. H. (1984). Fitting multivariate models to unequally spaced data. In Tune Serres Analysis 
of Irregularly Observed Data. Lecture Notes m Statist. (E. Parzen, ed.) 25 158-188. 
Springer, New York. 

SHUMWAY, R. H. (1984). Some applications of the EM algorithm to analyzing incomplete time series 
data. In Tıme Series Analysis of Irregularky Obserced Data. Lecture Notes ın Statist. 
(E. Parzen, ed.) 25 290-324. Springer, New York. 

Tona, H. (1977). Some comments on the Canadian lynx data. J. Roy. Statist. Soc. Ser. A 140 
432-436, 

Toxa, H. (1983). Threshold Models in Non-linear Tıme Serres Analysis. Lecture Notes wn Statist. 21. 
Springer, New York. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF CALIFORNIA 
BERKELEY, CALIFORNIA 94720 


J. FRANKE AND E. J. HANNAN 
University of Frankfurt and Australian National University 


This paper by Martin and Yohai will stimulate much future research. The 
authors are to be congratulated for that and for the presentation of the paper, 
which stresses statistical intuition and avoids technical detail, where possible. 

The central point of their approach to the generalisation of the concept of 
influence function to a time series setting is the explicit dependence of that 
function on the arc along which the measure of the observed process approaches 
that of the nominal process. This emphasis on a specific model for the contamina- 
tion is necessary because of the great range of possibilities for contamination in a 
time series setting. However, one can, consequently, ask how strongly the conclu- 
sions with respect to robustness and relative performance drawn from this 
influence function depend on the contamination model. The model (2.2) is very 
general but the major part of the paper and the examples in Section 5, in 
particular, deal only with zY given by (2.4). Consider, for example 

k-11 
(1) Yı z x, + Thy un = 2 Bei 
0 


where the e, are iid. with distribution (1 — p)ôp + pH. Here the contamination 
is generated by impulses which excite a linear system whose effect is imposed on 
the nominal process. The model (1) is included in the general model (2.2) and 
both (1) and (2.4) could generate similar patterns of outlier patches so that it 
would be difficult to distinguish between them from the data. Of course it can be 
hoped that conclusions from the influence function, based on (2.4), will not differ 
substantially from those that would have been derived via (1), for example, since 
essential aspects of the influence function, such as gross error sensitivities, are 
essentially qualitative in nature and small numerical differences will be of no 
consequence. However, basically different types of outlier, e.g., isolated outliers 
compared to those occurring in patches, appear to lead to large differences, as 


INFLUENCE FUNCTIONALS FOR TIME SERIES 823 


Figures 1 and 2 of Martin and Yohai show, and this is an agreeable feature of 
their approach. 

The main example of the paper, for practice, is the robust estimation of 
ARMA parameters. In considering various estimation procedures via the in- 
fluence function it is necessary to consider the purpose of the analysis. Three, not 
entirely disjoint, purposes appear to have been as follows. (1) The use of an 
ARMA model to approximate to the structure of a series so as to obtain 
understanding. (2) The use of an ARMA model to construct a spectral estimate. 
(3) The use of an ARMA model for prediction and control. It is not obvious that 
estimates of ARMA parameters that are robust lead to spectral estimates with 
good properties. A spectral density influence function is needed to judge this. The 
same kind of thing can be said of control and prediction and again it would be 
interesting to transfer the concept of influence function to the control setting. It 
seems that the main purpose of influence function analysis must be the first of 
the three listed above. 

If one fits an ARMA model with this first objective in mind one has to be 
aware of problems of misspecification, which have been only recently examined. 
(See, for example, Hannan (1982) and Shibata (1980).) There is a fair understand- 
ing of what happens when the model order is chosen too small or too large or, 
more realistically, the truth does not lie in any of the parametric sets examined, 
even when the order is left to be determined from the data. For robust ARMA 
procedures one has to face the question whether the influence function, derived 
on the basis of a given parametric model being specified, can be usefully 
interpreted in the realistic setting. Again the extension of the idea of the 
influence function to a more general setting, where the ARMA model is no more 
than an approximation to the truth, is desirable. 

Finally we discuss Section 8.2. Examples of the effects of a few isolated 
outliers on spectral estimation have been given in Bloomfield (1976, Section 5.3), 
Kleiner, Martin, and Thomson (1975), and Tukey (1984, Chapter 29). Having in 
mind the intuitive understanding these give it seems unsatisfactory that the 
spectral density influence curve IC,(£), derived on the basis of an additive outlier 
model, does not depend on frequency. Small aberrations at high frequencies with 
low power may be as important as larger effects at frequencies with large power. 
The performance of the spectral estimator might better be measured in terms of 
relative mean square error. See for example Priestley (1981, Section 7.2). One 
might better examine the influence function of log S,( f) as an estimate of 
log S(f ). 


REFERENCES 


BLOOMFIELD, P. (1976). Fourier Analysts of Tıme Series: An Introduction. Wiley, New York. 

HANNAN, E. J. (1982), Testing for autocorrelation and Akaike’s criterion. In Essays of Statistical 
Sctence (J. Gani and E. J. Hannan, eds.) 403-412. Applied Probability Trust, Sheffield. 

KLEINER, B., Martin, R. D. and THomson, D. J. (1979). Robust estimation of power spectra. 
J. Roy. Statist. Soc. Ser. B 41 313-351. 

PRIESTLEY, M. B. (1981). Spectral Analysis and Time Series. Academic, New York. 


824 DISCUSSION 


SHIBATA, R. (1980). Asymptotically efficient selection of the order of the model for estimating 
parameters of a linear process. Ann. Stasi. 8 147-164. 
TUKEY, J. W. (1984). The Collected Notes of John Tukey 2 (D. R. Brillinger, ed.). Wadsworth, 


Monterey, Calif. 
UNIVERSITY OF FRANKFURT AUSTRALIAN NATIONAL UNIVERSITY 
DEPARTMENT OF MATHEMATICS DEPARTMENT OF STATISTICS 
JOHANN WOLFGANG GOETHE UNIVERSITY MATHEMATICS BUILDING, LAS. 
P.O. Box 111319, 6000 FRANKFURT AUSTRALIAN NATIONAL UNIVERSITY 
WEST GERMANY , G.P.O. BOX 4, CANBERRA, 2601 


AUSTRALIA 


Hans R. Ktnscu 
ETH, Zurich 


Martin and Yohai provide an interesting study on the effect of atypical 
observations on the behavior of estimators in time series. The influence func- 
tional given by Definition 4.2 is the infinitesimal asymptotic bias in a one-param- 
eter family of contaminations of a given model. The bias was also the starting 
point of my own paper (1984, cf. Section 1.2), but I treated only a smaller class of 
estimators and I focused on different aspects. So let me explain the differences 
between the two approaches and discuss their advantages and disadvantages. 

Heuristically the connection between ICH and IF is as follows. ICH is the 
derivative in all directions, i.e., the gradient of T. Hence by the chain rule of 
differential calculus one gets, formally, 


Z T( u’ dT : T ICH £ d 
=a (u3) =< grad T, Ma JICHO) gy"). 
If T depends only on the m-dimensional marginal, we can find (d/dy)p}, in the 
model (2.4) by the following argument. Ignoring terms of order o(y), there is at 
most one block of outliers intersecting with (1,0,...,2 — m), and the initial point 
of this block is distributed uniformly over (1,0,...,3 — m — k). To me, the most 
important theoretical contribution of Martin and Yohai is Theorem 4.2 where 
they show that the same result also holds for m = co, at least if | depends only 
weakly on values far away. Since the uniform distribution on all integers is not 
finite, a bounded yj is not sufficient for the boundedness of (d/dy)T( pt). 

Some of the arguments in the proof of Theorem 4.2 involve the specific 
contamination model while others are valid more generally. Since the latter may 
be useful in other situations, I propose to split it in the following way. 


THEOREM 4.1’, Let T be a ) estimate with tọ = T(u,) and put m(y,t) = 
Ely yy, O]. If 
(a’) T(w}) — to = OCy); 
(b’) m(0,t) is differentiable at t = t, and the derivative C is nonsingular, 
(c’) b(t) = lim(m(y, t) — m(0,t))/7 exists and the convergence is uniform for 
jt — to] S £% 


INFLUENCE FUNCTIONALS FOR TIME SERIES 825 


then lim(T(p!,) — to)/y exists and is equal to 
—Cb(t)) = lim, E[ICH(y7)]/v. 
PRroor (extracted from Martin and Yohai). First note that because of (b’) 
and (c’) b is continuous at tọ. Using m(y, T(n7)) = 0 we thus obtain 
(ma(0, T(x3)) — (0, to))/y = —(m(y,T(Hy)) — m(0,T(H)))/v 
—b(to) + b(t.) — b(T(uz,)) + o(1) 
—b(t,) + o(1). 


| 


On the other hand by (b’) and (a’) 


(m(0, T(u?)) — m(0,to))/7 = C[T(n%) - to] /v + o(1). o 


In the situation of Theorem 4.2, the condition (c’) above follows from (9.10), 
(9.13), and (9.4)-(9.6). 


A limitation of the results obtained by Martin and Yohai lies in the special 
class of contamination models considered. The general replacement model (2.1) 
does not contain innovation outliers, and some of the results from Sections 5 and 
6 do not generalized from AO’s to other types of outliers. For instance if we 
calculate the GES for pure replacement outliers instead of AO’s we get the 
following values for the HKW estimators in the AR(1) model at ¢ = 0.5 (cf. 
Table 6.1): 


k=1 k = 20 
GM-H —2.5 2.3 
GM-BS -17 2.6 


So it is not always true that the bisquare is to be preferred to the Huber function. 
Even in the case of location estimators with i.i.d. observations, the AO model has 
some peculiar features. If we choose the constants such that both estimators have 
95% efficiency at the Gaussian, the GES for Huber is 1.63 and for bisquare 1.35. 

For real data it is impossible to say if one has additive or pure replacement 
outliers or if there is some dependence between the (x,), (w,), and (z,). Because of 
this and because one time series often contains different types of outliers, the 
class P in Definition 6.1 of the GES has to be chosen quite large. The authors do 
not discuss this point, but I think that one should take at least all joint 
distributions of (x,, w,) with marginal u, and possibly also all block lengths k in 
the model (2.4) for the (z,). With such a class Hampel’s optimality problem 
becomes intractable, but one possibly will have to live with this situation. 
Optimality in such small classes as considered in Section 7 is not very helpful in 
my opinion. 

A second limitation comes from the fact that—in contrast to the iid. 
case—-the infinitesimal asymptotic bias does not coincide with the standardized 


826 DISCUSSION 


influence of one outlier in a long time series. In order to make this clearer, let us 
consider a GM estimator for the AR(1) model as an example. If (x,, Xg... Xn) 
is a sample from the clean process and 1<i<n we obtain (cf. (1.27) of 
Künsch (1984) and (5.10): 


Deas tut Dyes eit gy TRS gee) 


cra tn 


= nB\(1 a e) (alz, = PX} + v, x, (1 = $) 
(x41 — $x, — ov, (x, + o)(1 ~ 7”) 
a(x, a $X, 1s x, (1 z #)'^3) = EA mo px,, x,(1 ~ ey). 


If we take averages, only the second term remains in accordance with (5.6). But 
that does not mean that the first term is small or even zero. In the HKW case it 
takes any value between inf and sup 7. If we only look at the infinitesimal 
asymptotic bias, we thus underestimate the possible effect of one outlier. These 
considerations led me to take sup|ICH| as the sensitivity measure. This gives an 
upper bound for the above expression although it may be too pessimistic and gets 
worse as the range of ẹ increases. 

Finally let me point out an important open problem: How can we assess the 
effects of deviations from the assumption of stationarity? The difficulty is that 
with most models the effect of the nonstationarity becomes either negligible or 
dominant as n increases. One possibility would be to rescale the nonstationarity 
with each n in order to get reasonable asymptotics. In the case of a trend this 
would mean to consider y, = x, + f(1/n), Ya = Xg + f(2/n),..., Yn =x, + FO). 
At least if ) has finite range, I can show that the asymptotic value is the solution 
of 


SSU + f(u), Xo + flu), ers f(u), T) du(x,) du =0 


and the present techniques could be applied. However, in this way we have 
returned to the land of stationarity, and I wonder if there is a better approach. 

In summary, this is an important contribution, particularly because it covers a 
very general class of estimators in time series. However, I have some reservations 
on the use of the AO model, and I do not think that the infinitesimal asymptotic 
bias alone captures the main effects of outliers in large, but finite samples, since it 
involves too much averaging. 


REFERENCE 
Kunscu, H. (1984). Infinitesimal robustness for autoregressive processes. Ann. Statıst. 12 843-863. 


SEMINAR FOR STATISTIK 
ETH-—ZENTRUM 
CH-8092 ZURICH 
SWITZERLAND 


INFLUENCE FUNCTIONALS FOR TIME SERIES 827 


RoBERT B. MILLER AND JAE JUNE LEE 
University of Wisconsin, Madison 


In this paper the authors have made a convincing case for the need to modify 
Hampel’s definition of influence curve before using it in time series analysis. The 
basic intuition is simply stated. Time series have “memory,” so a definition using 
the concept of the influence of observations one-at-a-time must be inadequate. It 
is helpful to have this intuition reinforced by the analysis provided by the 
authors. 

We like the model form y? = (1 — z7)x,+ zřw, as a generalization of the 
usual contamination model, as it appears to us to be a more realistic model for 
outliers. Concerning the technical aspects of the paper we have several questions, 
however. In their general replacement model (2.1) the authors require the 
contaminated process to be stationary and ergodic. Is it not furthermore neces- 
sary to require joint stationarity of the (x, w, z) process when the components 
are dependent? 

While in some settings it is possible to consider estimates of the form 
T, = T(F,,) (see Huber (1981) and Künsch (1984)), the authors require a more 
sophisticated definition which defines T as a limit of sequences of functionals T, 
(see Hampel (1971)). They seem to require very weak conditions on the T,’s, but 
we wonder if the stronger one of equicontinuity may be needed. Without this 
condition, how can we be sure that for some fixed n, T,(X,,..., X,; F) is not 
very far from T(u,-) = 4, even for large n? We also wonder if more attention 
needs to be paid to the domain of definition of T. Suppose, for example, we take 
the domain to be the space of stationary and ergodic processes. Then we note 
that the IC is derived from the ICH, which is defined for some measures that are 
not stationary and ergodic. 

Finally, we note that the IC defined by the authors is process dependent but 
not data dependent because the IC is essentially obtained by “expecting” the 
data out of the ICH (cf. Hampel (1974) and Künsch (1984)). Thus this IC is 
appropriate for studying questions of “gross error sensitivity” but not questions 
of a pointwise nature. Gross error sensitivity is probably not a sufficient basis for 
evaluating robustness, so we feel additional criteria will need to be introduced to 
complete treatment of robustness in time series. 

We now wish to raise concerns of a practical nature. We write quite frankly 
wondering how important psi functions and influence curves will prove to be in 
time series modelling. As the authors note, experienced time series analysts are 
quite familiar with outliers, and perhaps it should not go without saying that 
these analysts have some pretty good ideas on what to do about them. The time 
indexing and the memory that make the theoretical treatment of outliers difficult 
provide some resources to guide the practical handling of outliers. Moreover, in 
practice, we have recourse to much richer models than those contemplated in the 
paper under discussion. 

As examples, we refer to pages 67—70 of Jenkins (1979) and to Miller (1986). In 
the first reference, a change in policy creates an “isolated” outlying observation 
followed by a gradual return to equilibrium. This effect is evident in the residuals 


828 DISCUSSION 


and is modelled by intervention analysis. (See Box and Tiao (1975).) In the 
second reference, a bivariate time series model of two fertility measures is 
constructed. The data analysis reveals the years of World War II as a “patch” of 
outlying observations. Intervention terms are added to the model to handle them, 
and the parameter estimates show that their amplitude is not constant. The main 
point is that outliers often have “assignable causes,” and if so they can be 
incorporated into the model, rather than downweighted. A secondary point is 
that outliers that do not have assignable causes may deserve to have full weight, 
as their downweighting may result in the underestimation of the variance of 
future observation. Jenkins (1979), in fact, handles a second outlier in his series 
by doing nothing to it because he can find no cause for it. 

We are very concerned about our ability to recognize data that have been 
generated from the models presented by the authors. We know, for example, that 
an AO model with a core AR(1) and contaminating white noise is, theoretically, 
an ARMA(I, 1) process. Yet there is evidence that 


(a) this situation is very difficult, if not impossible, to identify from the usual 
data analyses involving correlation functions or spectra; and 

(b) even when we try to fit an ARMA(1, 1) model to the data using nonlinear 
least squares, the parameter estimates have very unattractive properties. (See 
Miller (1980).) 


The authors have shown that once a contamination model is properly iden- 
tified, their estimation techniques are attractive. Do they have any suggestions 
for model identification? If not, do they have any notion of the effect of model 
misspecification on their estimation techniques? While confrontation of these 
questions may not be appropriate in a paper on influence curves for time series, 
we feel we need answers to them before the authors’ work can be applied. 

In closing we wish to express the spirit in which we hope research on robust 
methods in time series will be done. Time series are usually analyzed in an 
environment in which collateral information in available, whether it be knowl- 
edge of economic or social upheavals, of policy changes, or of malfunctioning 
equipment that monitors a stream or a machine. Moreover, time series analysts 
are trained to recognize patterns in correlation functions, spectra, and sequence 
plots that suggest fundamental modifications of simple, basic models, such as 
ARMA models. If robust techniques can help us do these things better, then they 
will be welcome additions to theory and practice. If robust techniques can only 
promise estimates of parameters of “core processes” without due regard for other 
events that impact the series of interest, and without suggestions for model 
identification, then we fear they will be of limited practical use. 


REFERENCES 


Box, G. E. P. and Tiao, G. C. (1975). Intervention analysis with applications to economic and 
environmental problems. J. Amer. Statist. Assoc. 70 70~79. 

HAMPEL, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42 
1887-1896. 


INFLUENCE FUNCTIONALS FOR TIME SERIES 829 


HAMPEL, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 
69 383-393. 

HUBER, P. J. (1981). Robust Statistics. Wiley, New York. 

JENKINS, G. M. (1979). Practical Experiences with Modelling and Forecasting Tune Series. Gwilym 
Jenkins and Partners (Overseas) Ltd., Jersey, Channel Islands. 

Kunscu, H. (1984). Infinitesimal robustness for autoregressive processes. Ann. Statist, 12 843-863. 

MILLER, R. B. (1980). Discussion of “Robust estimation for time series” by R. Douglas Martin. In 
Directions in Tune Serves (D. R. Brillinger and G. C. Tiao, eds.) 255-262. IMS, Hayward, 
Calif. 

MILLER, R. B. (1986). A bivariate model for total fertility rate and mean age of childbearing. 
Insurance: Mathematics and Economics. To appear. 


UNIVERSITY OF WISCONSIN 
1155 OBSERVATORY DRIVE 
MaDISON, WISCONSIN 53706 


H. VINCENT Poor 
University of Illinois at Urbana-Champaign 


1. General remarks. I would like to begin by saying that I enjoyed this 
paper very much. As with their previous works, both individual and joint, 
Martin and Yohai have achieved in this paper a nice combination of analytical 
rigor and practical significance driven by clearly presented intuition. I congratu- 
late the authors on this contribution. 

Despite their central role in many areas of robust statistics, the traditional 
influence curves proposed by Hampel have played a somewhat limited role in the 
study of robustness properties of statistical signal processing procedures for 
applications such as communications and control, primarily because of the 
restriction of their applicability to static models. Other approaches, such as 
minimax robustness, have proven to be much more useful in this context (see, for 
example, the recent surveys by Kassam and Poor (1985) and Poor (1986)). 
However, by allowing for the treatment of dynamic models, the notion of 
influence functionals as proposed by Martin and Yohai eliminates this principal 
disadvantage. The introduction of a heuristic tool of this type is thus a major 
advance from the viewpoint of robust statistical signal processing, and I can 
foresee a wide range of applications of Martin and Yohai’s ideas in this area. 


2. System identification. System identification is among the many appli- 
cations that can be examined in the context of the Martin-Yohai influence 
functional. For example, consider the simple problem of identifying a first-order 
time-invariant linear system from measurements of inputs and noisy outputs. 
This problem corresponds to the model 
(1) 8, = 98,_, + u,, ie Z, 


in which we assume that {u,} ez and {n,} ez are independent iid. (0,1) 
sequences and |0|<1. The nominal observation process {x,},<7 consists 


of the inputs and noisy outputs (ie. x= (£). and so we can think of actual 


q,=8,+7,, LEZ, 


830 DISCUSSION 


observations y, = e where {v,} ez and {r,} ez are generated by replacement 
models from {u,}, ez and {q,}, ez, respectively. 

M-estimates of @ in (1) (see, for example, Poljak and Tsypkin (1980)) are of 
the form 


(2) TLE are min Eo» - £ pit] 


li<l =] msi 


for appropriate functions p. The estimates of (2) have limit ẹ} function given by 


(3) 10:9 =¥[a- £ | y (m — 1)t-"0,,, 


m~ — oO m=- 


where wy = 9’. 

Within regularity on y, the influence functional of (3) for patchy outliers can 
be evaluated via Martin and Yohai’s Theorem 4.2. Upon examination of IF in 
this case one sees immediately that, for constant outlier-level {, the least-squares 
estimate of 0 in (1) is linearly unbounded in ¢ for outliers in the output 
observations and is quadratically unbounded in ¢ for outliers in the observations 
of the input. Also, although the usual robust functions yield bounded influence 
against output outliers, we see that any nondecreasing y is at least linearly 
unbounded in § for input-observation outliers. Thus, from the viewpoint of gross 
error sensitivity, redescending y functions are called for in this model. (Alterna- 
tively, ¥(é,)& could be replaced by a bounded n(¢,, ¢) as in the directly 
observed AR case discussed in the paper.) 

The general trend of the influence of patch length on M estimation in this 
model is not as obvious as that for the influence of outlier amplitude. For the 
particular case of least-squares estimation with constant-level outliers, analysis 
of (3) via Theorem 4.2 shows that patch length is irrelevant for output outliers. 
However, the influence of input-observation outliers on least-squares is 
O(6~?*/k) where k is the average patch length. Thus, there is clearly a need to ' 
consider permutation dependent issues when analyzing robustness in such 
models. 


3. Time-varying models. Although it has nothing particularly to do with 
the consideration of influence functionals versus influence curves, another useful 
aspect of Martin and Yohai’s formulation is the idea of analyzing estimates via 
the limiting form T() regardless of whether or not T, is actually given by T(u,,) 
for the empirical measure p.,,. This idea allows the analysis of some time-varying 
models of interest. For example, consider the problem of estimating the ampli- 
tude of a signal of known form from noisy observations, 


(4) x= bs, +n, i=1,2,..., 


where {n}; is ii.d. and {s,}%, is a known sequence. M-estimates of 0 based on 


INFLUENCE FUNCTIONALS FOR TIME SERIES 831 


corrupted observations { y,}%2, are of the form 


(5) T, € ane min 5 ply — 8), 
teR i 
which can be analyzed via 
~ Le 
(6) ¥(y; t) = lim — È shy,- ts,) 


t=] 


(with y = p’), assuming this limit exists in an appropriate sense. Note that, as in 
the iid. case, T() (i.e., the solution of f(y; t)u(dy) = 0) is a function of only 
the marginal distribution of y in this situation. 


4, Long-term serial dependencies. As a final comment, I mention another 
type of statistical contamination that would be interesting to examine from the 
viewpoint of influence functionals. In particular, the influence on time-series 
procedures of long-term serial dependencies such as those present in electrical 
systems due to so-called fractional or “1/f ” noises (see also Graf et al. (1984)) 
might be studied in this context. For example, one might consider the influence 
functionals of parameter estimates along a measurement-error-model trajectory 
{u 0 < y < 1} where p, is a Gaussian measure with zero mean and autocovari- 
ance {y,y,,,#(dy) = }[]k + 17+} + Je — 1177? — 2]R|’**]. A process described 
by », would arise, for example, from the increments of a fractional Brownian 
motion (Mandelbrot and Van Ness (1968)) with self-similarity parameter H = 
(y + 1)/2. The tail behavior of the spectrum of this process is O( {~’), with 
y = 0 yielding white noise. Alternatively, one might consider a mixed error 
process of the form (1 — y)e, + yw, with {e,} white and {w,} the increments of a 
fixed fractional Brownian motion. Examination of the local behavior of time-series 
procedures at y = 0 in either of these models would give an indication of the 
tolerance of such procedures to unexpected long-term dependencies. 


REFERENCES 


GraF, H., HAMPEL, F. R. and TACIER, J.-D. (1984). The problem of unsuspected serial correlations. 
In Robust and Nonlinear Time Series Analysis. Lecture Notes ın Statst. (J. Franke, W. 
Hardle and D. Martin, eds.) 26 127-145. Springer, New York. 

Kassam, S. A. and Poor, H. V. (1985). Robust techniques for signal processing: A survey. Proc. 
IEEE 73 433-481. 

MANDELBROT, B. B. and VAN Ness J. W. (1968). Fractional Brownian motions, fractional noises and 
applications. SIAM Rev. 10 422-437. 

POLJAK, B. T. and TsyPKin, Ya. Z (1980). Robust identification. Automatica 16 53-63, 

Poor, H. V. (1986). Robustness in signal detection. In Communications and Networks: A Survey of 
Recent Advances (1. F. Blake and H. V. Poor, eds.) 131-156. Springer, New York. 


COORDINATED SCIENCE LABORATORY 
UNIVERSITY OF ILLINOIS 

1101 WEST SPRINGFIELD AVENUE 
URBANA, ILLINOIS 61801 


832 DISCUSSION 


P. M. ROBINSON 
London School of Economics 


Martin and Yohai’s paper is a fine technical achievement, developing an 
interesting tool, the influence functional, for describing an aspect of time series 
behaviour, and continuing the authors’ work on the difficult and important 
problem of time series analysis in the presence of outliers. I have two points, one 
being a suggestion prompted by their discussion of hypothesis testing, and 
motivated by the need for test statistics with both good robustness and good 
power properties against given alternatives. My first and main point concerns 
Martin and Yohai’s approach towards dealing with the outlier behaviour de- 
scribed by their general replacement model, and to some extent this impacts on 
the use of their influence functional. 

Martin and Yohai’s general replacement model (2.2) is indeed “general,” and 
even in the pure replacement (PR) and additive outliers (AO) special cases it 
presents an identifiability problem to which GM and RA rules need not neces- 
sarily provide a useful solution. The non-Gaussian character of y and the 
nonlinear character of the GM and RA rules severely hinders a proper analysis of 
the identifiability problem. While Martin and Yohai’s results embrace w and v 
with no moments, even bad contamination can be modelled by w and v with 
finite variance, in which case, if their core x process is indeed “ usually Gaussian,” 
a second moment analysis may gain some insight into the identifiability problem 
in the LS case, and conceivably also into the possible impact of GM and RA 
estimators on the problem. Denoting means by m,, etc., and lag-j autocovari- 
ances by c,( J), ete., for the PR model 


(1) my=m, + (m, z m,)m,, 
ey(j) = (1 — m,)°e,( j) + mie, (7) 


+ {(m,,—m,)’ + ¢,(j) + en(J) AT 
For the AO model with v independent of x (as assumed by Martin and Yohai in 
Section 5) 


(3) m,=m,+m,m,, 


(4) J) = eg) + melj) + (m2 + elj) jel). 
Note that x’s ARMA coefficients are functionals of the c,(/). 

It is easily seen that the c,(j) can be quite unrecognisable from the c,( 7), 
leading in general not only to inconsistent estimation of x’s ARMA coefficients 
but also to incorrect order determination via criteria such as AIC. Can robustifi- 
cation alleviate these problems? Note that c, is determined not only by c, 
and c,, or c,, and the frequency of contamination m,, but also by c,. We 
cannot choose m,=m,, or m,=0 (thereby eliminating (m,,— m,)m,, 
(m,, ~ m,)°’c{j), m,m, and m?c,(j) from (1)-(4), respectively) without loss of 
generality because without further information only y can be mean-corrected; 
substantially different m, and m,,, or nonzero m, (by no means unlikely, it 


(2) 


INFLUENCE FUNCTIONALS FOR TIME SERIES 833 


seems) could conceivably result in y reflecting z’s autocovariance structure more 
than x’s, or w’s or v’s for that matter. By down-weighting extreme y’s, GM 
estimation (e.g., the Mallows variant, rather than M estimation) seems to take a 
step in the right direction, albeit in an ad hoc fashion, and there is evidence (e.g., 
Martin and Jong (1977)) that it does help; its not clear to me whether the RA or 
GRA rules are effective in reducing the inconsistency, notwithstanding Section 7 
and their evident relevance to innovations outlier (IO) problems. 

Because ARMA autocovariance structure is closed under addition and multi- 
plication, generally (2) and (4) imply that x, w, v, and z with ARMA-like 
autocovariances imply y with ARMA-like autocovariances, with ARMA orders 
at least as high as x’s. In Martin and Yohai’s “independent outliers” case, (2) 
and (4) become 


(2) c (j) = (1 - m,)}c,( j) + mie,(j) + a8, 
(4’) c,( J) = ¢,(j) + me (j) + ao, 


where a, > 0, a, > 0, and ô is the Kronecker delta. For example an AR( p) x 
and white noise w or v implies an ARMA( p, q) Y, q < p, 80 y’s AR order (and 
coefficients) matches x’s, but y generally has an MA component so AR(p) 
fitting to y, by GM, RA, and other rules, leads to inconsistency; an ARMA w or 
v with positive AR order implies y’s AR order generally exceeds x’s, so the 
inconsistency problem is if anything more serious. In Section 2.2 Martin and 
Yohai suggest a z process generating “patchy outliers”; it may be shown that 
this implies z has an MA(k — 1) representation, with c,(j) = (1 — p)* 
{(1 — pY — (1 — p)*}, 0 <j < k. The effect may be to further increase y’s MA 
order (though not its AR order) relative to the “independent outliers” case. 
Martin and Yohai’s “patchy outliers” model is only one such, and to the extent 
that we can identify outlier occurrence in real data sets it might be worth 
investigating whether this particular model warrants emphasis. For example, one 
can model binary time series to have a more general MA structure than theirs, or 
to have AR and ARMA structure; in the latter cases y’s AR order will generally 
be increased, causing inconsistency in the GM and RA estimators as well as 
leading to different forms of the influence functional. 

I hasten to add that Martin and Yohai are well aware that AO and other 
outlier models (though not IO ones) cause inconsistency, and Martin has to some 
extent addressed the problem in earlier work. I feel that the identification 
problem should be faced up to more squarely by making a conscious effort to 
take outlier models such as Martin and Yohai’s sufficiently seriously to allow 
them to determine the form of model to be robustly estimated, via arguments 
such as mine. In case (4’) with x AR(1) and v white noise for example, we could 
fit (albeit inefficiently) an ARMA(I, 1); or we could estimate only the (correctly 
identified) AR coefficients (again inefficiently) using the following modification of 
Martin and Yohai’s (3.6): 


v,(y?; ¢) = aly, aK PY- yo zi gy"), i 2 2, 
where y,_2, unlike y,_,, is uncorrelated with the MA(1) disturbance in y, 


834 DISCUSSION 
(though not necessarily independent of it, so non-LS estimators may still suffer 
from some inconsistency). More ambitious approaches would be to exploit 
Gaussianity of x and non-Gaussianity of outliers in a manner analogous to some 
solutions to the classical errors-in-variable problem, or to approach the non- 
Gaussian modelling problem head-on. I do not underestimate the complications 
and pitfalls in these alternatives, but I do not think that Martin and Yohai’s use 
of the same estimators in both independent and patchy outliers cases should be 
taken to imply that approximate knowledge of z’s properties (or of w’s or v’s for 
that matter) should not influence estimator choice. Their approach to estimation 
could be said to take for granted that we have almost no information about the 
character of w, v, and z. While this may more or less often be realistic, the 
authors also seem able to characterise some outlier patterns occurring in prac- 
tice, and calculation of their influence functional with a real data set in mind 
itself requires considerable knowledge of serial dependence and other distribu- 
tional structure of w, v, and z, as well as of x (though normality of x does not 
seem crucial to most of their theoretical results). Note that if we base the 
estimation rule on the “true” model for y, derived from stochastic assumptions 
on w, v, and z possibly as described above, we could still study the correspond- 
ing influence functional, more complicated though it may be; there is interest in 
the influence of outliers on consistent rules, as well as on inconsistent ones. 

Let me finally turn briefly to the question of hypothesis testing. In Section 8 
Martin and Yohai apply their influence functional to the Box—Pierce port- 
manteau statistic 


L 
W= Lr. 


t=] 


It is known that V,/ is asymptotically equivalent to the score test statistic 
against AR(L) or MA(L) alternatives, based on a Gaussian likelihood, and is 
thus asymptotically locally most powerful against such alternatives. Not surpris- 
ingly, therefore, Martin and Yohai find that VŽ is not robust. A robust 
alternative, that maintains good power properties against specified time series 
alternatives, could be obtained by applying the score principle (or Wald or 
likelihood ratio principles) to an appropriate robustified loss function. While this 
should work well in JO cases I must echo my earlier reservations in the PR and 
AO cases; white noise x may be far from synonymous with white noise y. 


REFERENCE 


Martin, R. D. and JONG, J. (1977). Asymptotic properties of generalized M-estimates for the 
first-order autoregressive parameter. Bell Labs. Technical Memo., Murray Hull, N.J. 


DEPARTMENT OF ECONOMICS 
LONDON SCHOOL OF ECONOMICS 
HOUGHTON STREET 

Lonpon WC2A 2AE 

ENGLAND 


INFLUENCE FUNCTIONALS FOR TIME SERIES 835 


RuEyY S. Tsay 
Carnegie-Mellon University 


I am pleased to see an interesting paper on influence functionals for time 
series and would like to thank Martin and Yohai for giving me the opportunity 
to read the paper before its publication. It was more than ten years ago that Fox 
(1972) formally considered the problem of outliers (or contamination) in time 
series analysis. But only in recent years, did results of rigorous investigations on 
the effects of outliers or other deviations from normality appear in the literature. 
To a large extent, this is due to the complicated dynamic structure of the time 
series process. As clearly pointed out by Martin and Yohai and by others, any 
investigation of contamination in time series is inappropriate unless it takes into 
account the time configuration. With this recognition, it is time to investigate 
rigorously the contamination problem in time series and to consider seriously its 
practical implication in applications. I hope that the publication of this paper 
will mark a new beginning for robustness in time series analysis. 

Since many discussants are experts in robustness, I shall confine my comment 
to the time series part. For simplicity, I use the same notation as Martin and 
Yohai and assume that the mean value of a time series is zero. First, the idea of 
using contamination measures {u}: 0 < y < 1} in defining influence functionals 
is a good one. However, from the definition (2.2), one must handle the con- 
taminated process y, with care whenever y # 0 because in this case the distribu- 
tions of the “clean” and “contaminated” observations are different. Take the 
lag-one correlation coefficient p for example. Under the stationarity assumption 
(this is the case when y = 0), p = E(y,y,_,)/E(92) which is independent of time 
t. On the other hand, when y + 0 the meaning of p is time dependent depending 
on whether y, or y,_, is contaminated. Consequently, further clarification is 
needed in using the general replacement model (2.2). It seems to me that the 
important assumption is the stationarity of the core process x,, the contaminat- 
ing process w, and the 0-1 process zY. Notice that this is related to my comment 
below on forecasting which is concerned with the underlying generating mecha- 
nism of a time series. 

Second, from the examples shown in the paper, the influence functional is very 
much model dependent. It depends not only on the form but also on the 
parameter values of a model. In practice, neither the model nor its parameter 
values is known. They must be specified from the data. Therefore, from a 
practical point of view, one should consider the unknown model as part of the 
problem in studying the influence of contamination in time series analysis. Based 
on my limited experience, the problem of model specification is often tangled 
with the fact that contaminated data tend to show certain nonstationary 
characteristics that in turn might obscure the picture of possible models. 

Finally, forecasting sometimes is the main purpose of a time series analysis. In 
this case parameter estimation becomes an intermediate step from which the 
forecasts can be obtained. Suppose now that the series under study follows the 
contaminated structure (2.2). In this situation, should one construct optimal 
estimates based on the constraint of bounded gross error sensitivity or should 


836 DISCUSSION 


one treat (2.2) directly as the underlying generating mechanism of the process 
and derive forecasts from it? Perhaps one should also consider the purpose of 
time series analysis in defining influence functionals and gross error sensitivity. 

In summary, the paper marks important progress on robustness in time series 
analysis and I congratulate the authors on their fine work. 


REFERENCE 
Fox, A. J. (1972). Outhers in time series. J. Roy. Statist. Soc. Ser. B 34 350-363. 


DEPARTMENT OF STATISTICS 
CARNEGIE-MELLON UNIVERSITY 
SCHENLEY PARK 

PITTSBURGH, PENNSYLVANIA 15213 


EDWARD J. WEGMAN 
George Mason University 


The problem of robust inference in time series is a long-standing one to which 
Professors Martin and Yohai have made signal contributions. This present paper 
continues the series of excellent contributions and I am pleased to have the 
opportunity to comment. 

This paper lays out fundamental definitions of influence functionals and gross 
error sensitivity as a generalization of the corresponding concepts in the tradi- 
tional iid. case. These are illustrated with computations for several robust 
estimators of parameters in simple first order autoregressive and moving average 
models. These are basically toy examples, although the nonboundedness result 
for the MA(1) model is indeed intriguing. There is, however, a rich mine of 
further situations to explore, many of which may be formulated as the general 
replacement model. 

Low order autoregressive schemes are useful in a feedback tracking context. 
That is to say, if x, is a position-velocity vector subject to linear control by some 
guidance system, then a useful model for x, may be an autoregressive model. 
Traditional approaches to such a problem often involve Kalman filtering. Clearly, 
however, position-velocity sensors may be subject to gross errors, for example, 
sun glint in an infrared (IR) sensor. Clearly noise in such a system (as opposed to 
innovations) could be modeled as a mixture distribution. Supposing sun glint did 
affect an IR sensor, it is likely to persist for some time, highlighting the 
Martin-Yohai concern with patchiness in the noise structure. The point is that a 
mildly realistic problem readily suggests many complicated models of general 
interest—robust estimation of parameters in a general nonlinear process model 
or robust estimation of the parameters of the Kalman filter, to mention just two. 

A related, highly useful time series problem, perhaps the oldest time series 
problem, is the estimation of the sum of (essentially) deterministic sinusoids in 
white (often Gaussian) noise. Rotating machinery generates such sinusoids and 
the application is many-fold, including the obvious naval one. Often the ambient 
Gaussian noise is contaminated with impulsive noise, which can be either 


INFLUENCE FUNCTIONALS FOR TIME SERIES 837 


isolated (for example with biological or offshore drilling noises in the ocean 
acoustic setting, or thunderstorm noises in the electromagnetic setting) or patchy 
(say in the case of cracking—grinding ice in the arctic acoustic setting). Impulsive 
noise tends to be heavy tailed compared to the ambient noise process and is 
frequently modeled with double exponential marginals or even heavier tailed 
models. The book by Wegman and Smith (1984) contains several papers discuss- 
ing realistic mixture distribution noise models. In most cases the Martin-Yohai 
AO model would be highly appropriate. An interesting special case is the 
reverberation limited case. In such a case, to use the Martin—Yohai notation for 
the AO model 


w= x,+ Lax, +0, 
J 


a, unknown parameters. In most realistic cases, the fraction of contamination y 
would be close to 1 since the reverberation L,a,x,_, is likely to be present 
whenever the signal, x;, is present. Clearly the Martin~Yohai formulation of the 
IF and GES offers an exceedingly rich context for formulating many interesting 
time series estimation problems. 

Another interesting problem whose robust formulation is unclear to me is the 
intervention problem. Suppose, for example, that x, is a process whose funda- 
mental probability structure changes as some unknown time, ¢). How can we 
estimate t? 

On the surface, it would appear that a robust technique would tend to 
attribute deviations of y7, i >t), to outliers and hence suppress them. Ad- 
mittedly I have thought only superficially about this. It is clear, however, that a 
rather more complex formulation is needed. It may be the case, for example, that 
x,, Ww, are independent if i < t but dependent for i > tọ. Similarly the signal 
detection problem could benefit from a robust treatment, but its robust formula- 
tion is unclear. Indeed, in both cases the statistic involved must distinguish 
rather subtle differences in underlying models which, in fact, may be masked by 
the robustified statistic. Still such problems in realistic circumstances often have 
data which are contaminated by outliers. 

The point of this comment is not to critique the authors for what they have 
not done. In all truly innovative work there is obviously much left to be done. 
The point is only to mention a few situations which could profit from further 
exploration. The formulation of IF and GES in the time series context is an 
important step forward. I cannot resist observing that, in view of the inability to 
use Gateaux derivatives, it was clearly no piece of cake. 


REFERENCE 


WEGMAN, E. J. AND SMITH, J. (EDS.) (1984). Statistical Signal Processing. Marcel Dekker, New 
York. 


CENTER FOR COMPUTATIONAL STATISTICS 
AND PROBABILITY 

GEORGE MASON UNIVERSITY 

4400 UNIVERSITY DRIVE 

FAIRFAX, VIRGINIA 22030 


838 DISCUSSION 


MIKE WEST 
University of Warwick 


The authors are to be congratulated on a succinct mathematical development 
of time series influence functionals that generalises and usefully extends existing 
concepts and techniques of classical robustness. The definition directly caters to 
the types of outliers specific to time series and reflects the authors’ experience 
with practical modelling and analysis using robust techniques. Central to the 
paper is the general replacement model for linear (gaussian) series which, to my 
mind, is the most interesting contribution. There are many ways of representing 
the various outlier types associated with time series, all closely related to this 
general model. My own preference is for the simple, state space type of model, 
which allows for the various outliers hierarchically. Such a model is 


Y, =Z, + v, 
(1) Z: = X, + ô; 
X, = G,(X,_) + Oys 


where Y, is the observed series; v, is a zero mean observational noise process, 
typically comprising independent errors; X, is the core process defined as a 
function of X,_, = {X,_,, X;_2,...} and the process evolution noise w,; and 8, 
represents a superimposed contamination process. Additive, or purely observa- 
tional, outliers are modelled by large », and changes in the core process X, are 
modelled by large w, The ô, process introduces patchy outliers that may be 
viewed as purely stochastic or related to independent variables via regression or 
transfer function effects. 

Models such as (1), and more complex versions of them, have been used 
extensively by Bayesian forecasters in applications where protection against 
additive outliers and adaptation to changes in the Z, process, via ô, or w, are of 
importance. The authors are, of course, familiar with this approach but, in their 
current paper, part company with Bayesian forecasters in important ways. In 
my opinion the techniques proposed are most appropriate for fast processing of 
long series of observations with short sampling intervals when a sustained, 
stationary core process is evident. Such applications may arise commonly in the 
engineering fields with which the authors are familiar. In such areas, where 
interest centres on the estimation of the stable core process and large amounts of 
data are available, the dominating data smoothing feature of robust techniques 
is very relevant. Otherwise, the objectives of the time series modelling activity 
must be more carefully considered. Models such as (1) are directly geared to the 
specific operational requirements of sequential forecasting that is a primary goal 
for Bayesian modellers. Here the ability to detect and distinguish the various 
outlier types and adapt forecasts appropriately is paramount. Omnibus robust 
methods would tend to overamooth and hence obscure the local behavior that is 
so relevant in short-term forecasting. Simple sequential techniques for outlier 
detection, intervention and adaption to change are described in West (1986) and 
applied in dynamic Bayesian forecasting by West and Harrison (1986). The 


INFLUENCE FUNCTIONALS FOR TIME SERIES 839 


occurrence of patchy outliers, and possible explanation using independent vari- 
ables hitherto omitted from the model, is of great interest in improving forecasts, 
with the emphasis on identifying and estimating the ô, process and its develop- 
ment into the future. 

A related point of contention is the assumption of a stable core process 
defined by constant parameters whose estimation is the primary goal. A key 
underlying principle in dynamic Bayesian modelling is the rejection of stationar- 
ity in general and the associated allowance for parametric changes over time. 
Unlike the above mentioned engineering application areas, business and eco- 
nomic series, typically rather short in length, exhibit only local stability, global 
nonstationary, with both sustained, steady, small changes and more marked, 
abrupt changes in defining parameter values. Thus the primary goal of the 
author’s robust estimation techniques would appear to be limited in scope for 
application. Can it be adapted to allow for dynamic parameters changing over 
time? This would be particularly important if independent variables were to be 
incorporated. Change over time of regression coefficients is not only expected as a 
general, steady dynamic, but also vital in allowing for the unpredictable effects 
of further related variables not recognised as being of importance. 

The authors may be interested in considering extensions of their techniques, 
and their outlier models, to nonstandard problems such as those arising with 
non-gaussian processes. Outlier models apart, there are important questions 
raised as soon as the non-gaussian nature of time series is admitted. Suppose for 
example, that the series is discrete, or simply binary. Binary series arise both 
naturally and by construction in many areas. Particular examples, quite common 
in applications where data rates are high, and data reduction necessary, concern 
series derived from underlying, continuous processes via clipping operations. 
Specifically, if Y, is such a basic process, a binary series S, is derived by clipping 
Y, at level A if S, has the representation 

1, if Y,2A, 
a o if Y, < A. 
Clearly the theoretical characteristics of the S, series may be derived from any 
suitable continuous time series model for Y,. In the outlier modelling framework, 
models such as (1) should produce interesting contaminated binary series. 

My own approach to practical modelling for non-gaussian series has, however, 
been somewhat different, being based on the development of dependence models 
for the S, series directly. In the binary case, the family of dynamic generalised 
linear (and nonlinear) models introduced in West, Harrison and Migon (1985) 
provides a rich class of process structures currently under study. As a simple 
example, a first order autoregressive type of model for S, that parallels the 
standard linear, gaussian state space model, is given by taking 

P(S, = 1\2,) = 7, (0 <1, <1), 
where, setting Z, = log(,/(1 — 7,)), then 


(2) Z,= $Z, + o 


840 DISCUSSION 


for some (gaussian?) noise sequence w, Thus the unobservable AR process Z,, 
provides the “success” probability for S, after transformation. Generalisations to 
higher order processes, transforms other than the log-odds, and time-varying 
parameters (¢, rather than simply ¢), are evident. The outlier model (1) can now 
be extended to this binary series by a minor extension of (2) to 

Z,= X,+ ô, 

X, = $X + Op 


incorporating changes via w, series, patchy outliers, and, now, observational 
outliers through appropriate models for the ô, series. The only point of signifi- 
cant difference between this model and (1) is that the sampling model is now 
Bernoulli, rather than gaussian, which leads to a slightly different view of the 
way in which observational outliers are generated. A closely related, but struc- 
turally quite different, class of models for binary series provides for dynamic 
evolution of transition probabilities in Markov chains. The first order case, for 
example, has a basic model for P(S, = 1{7,) as above, but, instead of the 
continuous process model for the log-odds probability Z, in (2), a discrete version 


(3) Z, = 8, + OS, + Op 

where @, and ¢, are time-varying process parameters and w, as usual, process 
evolution noise. Concerning outlier models, a basic problem arises with (3) in 
that the observations S, are fed back into the process model, so a little more 
thought is required in modelling pure observational outliers. Perhaps the authors 
have some comments on such problems. 


REFERENCES 


WEst, M. (1986). Bayesian model monitoring. J. Roy. Statist. Soc. Ser. B 48 70-78. 

WEsT, M. and Harrison, P. J. (1986). Monitormg and adaptation in Bayesian forecasting models. 
To appear in J. Amer. Statist. Assoc. 

West, M., HARRISON, P. J. and Micon, H. S. (1985). Dynamic generalized linear models and 
Bayesian forecasting (with discussion). J. Amer. Statst. Assoc. 80 73-97. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF WARWICK 
COVENTRY CV4 7AL 
ENGLAND 


REJOINDER 


R. DOUGLAS MARTIN AND VICTOR J. YOHAI 


University of Washington and University of Buenos Aires 
and CEMA, Buenos Aires 


The discussants have provided us with more than ample food for thought 
concerning a myriad of issues related to our work on influence functionals for 
time series. Leading issues include the following: (1) Relationships and dif- 


INFLUENCE FUNCTIONALS FOR TIME SERIES 841 


ferences between ICH and IF. Aspects of this include (a) the model-dependent 
feature of IF, and (b) the desirability or lack thereof of averaging inherent in the 
definition of IF; (2) data-oriented measures of influence; (3) generalizations 
and/or modifications of the IF to cover prediction, spectral estimates, non- 
stationarity, testing, etc.; (4) lack of consistency and bias control; (5) robust 
model selection; and (6) innovations outlier models, intervention, adaptivity, etc. 


Relationships and differences between ICH and IF. Künsch has pro- 
vided a transparent heuristic formula which displays the relationship between 
ICH and IF for time series, and allows one to see clearly why a difficulty can 
arise when T does not depend only on a finite dimensional marginal measure: 
namely, boundedness of Ý does not necessarily yield boundedness of IF. This is 
an enlightening viewpoint, which gets at a problem special to the time-series 
setting in short order. Also, Kiinsch’s suggestion to split Theorem 4.2 up in the 
manner of his Theorem 4.1’ is a useful and welcome contribution. It is indeed 
desirable to state a main result in a form which is as free as possible from 
model-specific assumptions, and therefore facilitates a wider range of applica- 
tions of the IF. 

Kiinsch’s main concerns with our approach are that IF involves (too much) 
averaging (of ICH), and that the IF and GES are determined by contamination 
models which are too specific and narrow. 

The basis of Kimmsch’s complaint about model specificity is evidently his view 
that one seldom knows the type of contamination in advance. However, it is our 
experience that on the contrary, for many time series arising in practice the 
investigator does indeed have a pretty good idea of whether patchy or isolated 
outliers are to be encountered (and perhaps a little detail is available concerning 
qualitative patch shape, but not too much else.) Indeed, Wegman has indicated 
the viability of the forms already included in (2.2) for a variety of real-world 
applications, where either patchy or isolated outliers are expected. In the case of 
radar glint noise, for example (Figure 14 of Martin and Thomson, 1982), the 
outliers consist of spikes having a moderately consistent shape, with random 
amplitude and separations which are approximately independent exponential 
random variables. In target tracking contexts, the amplitudes will be negligible 
and hence there will be no outliers at far range situations, whereas the ampli- 
tudes will be large and the outliers will be quite potent at close ranges. Aside 
from the varying amplitudes and separation times, the structure of the outlier 
model is relatively constant. 

Given that one can in many cases be relatively confident of outlier type, the 
model dependent aspect of IF is highly desirable. In such cases, IF and GES 
calculations, and optimality results over “narrow” classes can help indicate what 
type of estimate is preferred with regard to its infinitesimal bias control. Further- 
more, in case different specific types of outliers are considered likely to appear, 
one may use the corresponding GES criterion to select the estimate instead of the 
IF. 

Of course when one is indeed relatively ignorant of the type of outliers to be 
encountered, then optimality over narrow classes is indeed of little use, and 


842 DISCUSSION 


Kiinsch’s suggestions concerning the choice of P in Definition 6.1 may be 
appropriate. Furthermore, under complete ignorance one might be happy with 
the conservative /pessimistic approach of optimally bounding GESH * sup|ICH| 
in the time series case as Kiinsch (1984) has done. 

With regard to the pure replacement (PR) versus additive outliers (AO) issue, 
Ktinsch’s PR calculations show that the Huber function can be preferred over 
the bisquare function (which is opposite to the result of our AO calculation). 
However, it is our opinion that the PR model is seldom appropriate in practice. 
We are not thereby suggesting that AO is always appropriate. It is just that the 
value of a time series at an outlier position will usually contain some shadow or 
vestige of the core process, and in these cases AO will often be a better 
approximation than PR. Furthermore, AO will often be quite a good approxima- 
tion—this is true, for example, in the glint noise example cited above, and it is 
certainly true in those situations for which intervention analysis (Box and Tiao, 
1975) is appropriate. Also, it appears feasible to construct statistics which 
discriminate between PR and AO. 

In any event, we feel that the understanding gained by the calculation of IF’s 
for different estimators at different contamination models yields insight concern- 
ing the interplay between different estimators and different contamination mod- 
els which is useful in its own right. Many more such calculations remain to be 
carried out. Such calculations can help resolve the kind of question raised by 
Franke and Hannan: How much does IF depend on different contamination 
models having qualitatively similar sample paths? We pursue this question with 
respect to their particular model (1). 

Although model (1) does not quite fit into the general model (2.2), it does fit 
into the following slight generalization: 

(2.2’) yt = (1 — 27 )x, + zw}, 

where now the contaminating process w has a distribution depending on y. In 
order to get model (1) we can take z7 as in (2.4) with y = kp and 2? = g(e,) 
where g(t) = 1 for t + 0 and g(0) = 0, and then set w? = x, + L*2)B,e,_,. Since 
the distribution of the e,’s depends on p and therefore on y, we need (2.2’) 
instead of (2.2). 

The definition (4.5) of IF can be extended without modifications to the more 
general model (2.2’). In the case of model (1) with x, an AR(1) process we get for 
GM estimates 


k-1 
IF({u%,}, Tom {u3}) = X Ea(u, + (B, — $B,-1)€0 2($)4(%o + B,-10)) 


+En(u, = $B, £05 h(¢)(xo 2i B,e0))> 
where h() = (1 — ¢*)'”. In the case of 8) = © = -1 = 1, we get 
IF( {u7}, Tom {H1}) = (k — IEn(u, + (1 — $)e0, h(p)(xo + £0) 


+En(u, — $9, h()(x£0 + £0)), 


INFLUENCE FUNCTIONALS FOR TIME SERIES 843 


which is very close to our formula (5.8) for patchy additive outliers. This is 
reassuring, since, as observed by Franke and Hannan, the qualitative behavior 
of model (1) will be very similar to our case of patchy additive outliers, provided 
that all the £, have the same sign. 

The issue of whether or not IF involves too much averaging is a most basic 
one. It is primarily when the number of outliers is small, which corresponds on 
average to yn small, that the averaging-based IF is unlikely to provide a good 
indication of the influence of outliers on an estimate in finite samples. Such 
situations are indeed troublesome, for neither sup|ICH| nor IF is likely to give a 
uniformly accurate assessment across different configurations of outliers. On the 
other hand, such situations are precisely where totally data-oriented measures of 
influence for time series, such as that suggested by Brillinger, may come into 
their own. Of course when the number of outliers is moderate to large, one must 
use an appropriate averaging in order that IF adequately reflect the influence of 
the outliers. Analysis of the data et hand, aided by any reasonably robust 
method, will often provide useful guidance here. 


Data-oriented measures of influence. SBrillinger’s suggestions concerning 
“Jeave-one-out” diagnostics/influence measures for time series are highly ap- 
propriate, both for their own sake and for the complementary nature the 
techniques have relative to ICH and IF. The “leave-one-out” approach is 
tailor-made for Kiinsch’s “single outlier in a series of length n.” This natural 
data-oriented approach for time series has been neglected for so long, in spite of 
Brillinger’s (1966) proposal, only because the method is fairly computing inten- 
sive, and (as Brillinger points out) good algorithms for handling missing data 
problems in time series have been developed only relatively recently (see 
Brillinger’s references). There are, however, some issues concerning leave-one-out 
diagnostics for time series which should be mentioned. 

First of all, there is a clear smearing effect associated with the influence of 
isolated outliers in the leave-one-out approach. This effect is evident in 
Brillinger’s Figure 3: Adjacent to each “large” peak there are one or two values 
of roughly half the local amplitude of the peak. The reason for this behavior is 
inherent in the Gaussian maximum-likelihood leave-one-out technique, and it can 
conceivably give a false indication of a single dominant ovtlier when in fact there 
are two outliers separated by a single good point. 

Another, perhaps more serious difficulty, is that leave-one-out diagnostics can 
fail to give an indication of problems when the outlier is of a “k-in-a-row” patch 
form. This is a special form of what has been called the “masking” problem in the 
regression diagnostics literature. In the regression setting the masking problem 
has been relatively ignored due to the computational burden required to check 
for masking in unstructured problems, namely order (2) when k well-masked 
outliers are present. However, in the structured time-series setting we can easily 
detect masking due to a single patch of length k in order n 
by computation of “leave-k-out” diagnostics whereby 4,41, ¥,49,+++» Kap are 
deleted and a Gaussian missing-data MLE is used to fit the model for 
i = 0,1,..., n — k. 


844 DISCUSSION 


Robust Smoother’ Hampel Two-Part Radescanding Psi, a=t 75, b=3 00 











N 
5 * 
3 o seacidebialeiniiieninieicemneemiteeinniinniedininenininamioniionienien” aaicoblacenaneaeain’ RS AAA kie Eki 
2 : | v | 
ae 
N me — L = = oa 
1820 1840 1860 1880 1900 1920 1940 
Year 


Fic. 1. Restduals y, — 2, for Canadian lynx data, where 2, ıs smoother-cleaned value based on 
robust ARMA (3, 3) fit. 


Of course one cannot completely solve the masking problem with any degree of 
computational ease. The existence of more than one patch, or more than one 
isolated outlier, or both, can result in masking which can only be completely 
dealt with by an order of computational complexity which approaches that of 
unstructured regression problems. It is possible that some data sets may contain 
too many configurations of outliers to effectively cope with. Fortunately, experi- 
ence indicates that: (i) complete masking does not occur with great frequency, 
even with multiple patches and isolated outliers, and (ii) iterative interpolation of 
the most influential data points will often reveal other influential points which 
are initially masked. 

Details concerning some of the various claims made above are provided in 
Bruce and Martin (1986). 

In answer to Brillinger’s question to us: There exist good robust techniques for 
fitting ARIMA models to data with many outliers, based on robust filter-cleaners 
or smoother-cleaners (see for example Martin, 1981, or Martin, Samarov and 
Vandaele, 1983), and these techniques along with their robust residuals diagnos- 
tics should always be used (along with other methodologies, including leave-k-out 
diagnostics) when one is not absolutely sure that the data is outlier free. These 
residuals diagnostics generally give a much clearer view of outlier structure than 
leave-k-out diagnostics, and take considerably less computational time (our 
current version of “leave-k-out” is still too slow to be really pleasant on a PC). 
Figure 1 shows the residuals y, — £, for a robust fit of an ARMA(3, 3) model to 
the Canadian lynx data. Since &,= y, for “good” data points, most of the 
residuals are zero. The nonzero residuals indicate nearly the same “suspect” data 
points as those revealed by Brillinger’s plot. 

In summary: while “leave-k-out” diagnostics appear to have a useful role in 
time series analysis, procedures based on robust filter- or smoother-cleaners 
would be preferred if just one of the two techniques were to be used. 


Generalizations of IF. The issue of generalizations and/or applications of 
the IF to problems such as prediction, testing, spectral estimation and long-mem- 
ory processes have been touched upon by Franke and Hannan, Robinson, Tsay, 


INFLUENCE FUNCTIONALS FOR TIME SERIES 845 


and Poor. Although Sections 8.1 and 8.2 make small contributions to IF’s for 
spectral estimates and tests, there remain a number of questions to be pursued. 
At the moment we are able to respond to some of the specific questions raised by 
these discussants. 

Consider the case of an ARMA-type spectral estimate. Let S(T(p?,)) = 
S( f; T(#?,)) denote the asymptotic functional representation of an ARMA-type 
spectral estimate, where 


T(u%) = (61(w%),--- (Hy), OES), Og u%), 87(u3)) 


with T(u,) = a the parameters of an x, process ARMA model and S(a) the 
corresponding spectral density. Then it is straightforward to calculate a point- 
wise influence functional IF,(,,, f, T) for S. Use of the chain rule gives: 


tn) [29] pa 


dy dT dy 


’ 
y=0 





IF, (4.0, f,T) = | 


and the first factor of the right-hand side does not depend on T (but does depend 
upon frequency f). Therefore in order to compare the influence curve of two 
estimators S(T,) and S(T.) of S(a), we only need to compare the influence curves 
of T, and T,. Hence, in answer to Franke and Hannan, robustness of ARMA 
model parameter estimates determined by IF properties is inherited by an 
ARMaA-type spectral estimate, albeit with a frequency dependent. weighting 
factor. Essentially the same is true if one computes IF, ¢ for log S. 

Of course, S(T) may not be a good estimate of the functional S in the case 
where x, does not conform to a parametric ARMA model, but the main issue in 
such cases is parametric approximation rather than robustness of T. 

Unfortunately, nonparametric robust spectral density estimates, such as those 
involving robust prewhitening described in Kleiner, Martin, and Thomson (1979), 
are much more complicated, and simulation will be required to determine an IF, 
or IFs in such cases. 

Franke and Hannan note that the computed IF,(f, f ) in Section 8.2 does not 
depend on frequency. The reason for this is that the contamination is white noise, 
along with the averaging involved in the IF (see the earlier comments by both 
Kiinsch and ourselves). To get a measure of influence which will show the 
frequency-dependent effects associated with specific outlier patterns, one can 
either take a data-oriented approach as suggested by Brillinger and discussed 
above, or pursue the IF approach with an appropriate contamination model. In 
the data-oriented approach one could use the leave-k-out technique to fit a good 
AR or ARMA model and interpolate at the deleted points—the difference 
between spectrum estimates based on the original data and those based on 
“leave-k-out and interpolate” modifications of the data will show frequency- 
dependent influence (however, this data-oriented approach may often involve an 
embarrassing amount of computation). In the IF approach, u, might be selected 
so as to generate outliers resembling those seen in the data, e.g., in the waveguide 
of Kleiner, Martin, and Thomson (1979), the outliers might come in pairs having 
a fixed separation, but with random separation from pair to pair—the resulting 
IF,(f, f) will depend upon f. (Incidentally, such possibilities suggest how IF 


846 DISCUSSION 


may give a useful indication of how an estimate reacts to a specific configuration 
of outliers in the data at hand.) 

Franke and Hannan have also raised the question of how stable the IF is 
when the nominal model is known only approximately. Here stability is equiv- 
alent to asking for some kind of continuity of IF with respect to p. If we use the 
weak topology, continuity of IF is closely related to robustness (Hampel, 1971; 
Boente, Fraiman, and Yohai, 1982). According to (4.2) and (4.6) 


E(¥(y7,to)) 
y , 


with C given by (4.2’). If ¥ is bounded and depends only on a finite number k of 
coordinates, then the second factor on the right-hand side will depend continu- 
ously (with respect to the weak topology) on the corresponding k-marginal 
distribution of the nominal process x,. Similar results occur when ¥ depends on 
an infinite number of coordinates, but this dependence decreases quickly enough, 
e.g., as with GM and RA estimates. However, typically the behavior of C will be 
different. For example, in the case of GM and RA estimates, C depends on the 
first moment of x, when the 7 function is of the Mallows type and on the second 
moment of x, when it is of the Hampel type. Thus in order to have continuity of 
IF with respect to p, at a nominal model 4, ọ one may have to use a metric 
which implies closeness of the moments for p, and p, 9. 

Kiinsch, Poor and Tsay all raise more or less directly a quite important 
question: How does one deal with nonstationarity in the central model p,, in 
deviations from the central model, or in both? 

In order to be as general as possible, one might proceed as follows. Let {T,} be 
a sequence of univariate estimators indexed by sample size n, and let 7,7 denote 
the value of T, for the contaminated process y,’. The arc {p}} may now be 
nonstationary by virtue of one or more of the measures u., ,,,{u2} being 
nonstationary. Then define the (absolute-value) influence functional for nonsta- 
tionary processes as 


1 
TR( Hs (Ta), (u$}) = lim =E, limsup| 2,” — Te). 
2 n> co 


IF(u,,,T, {u4}) = -C7 lim |. 


Of course for many cases of interest, including the nonstationarity examples 
presented by Künsch and Poor, one will have T} > T(p%) and TP > T(u9) 
almost surely. In such cases we have lim sup, ~ wl Tp — Tal = |T(w%) — Tus) 
the expectation is superfluous, and IF, = |IF|. Correspondingly, given a family 
P of nonstationary arcs {4%} we define the gross-error sensitivity: 


GES(P, {7,}) = sep os CARTA 


Similar definitions may be given for the multivariate case. 

For some types of contamination, nonstationary process {w,} are not more 
harmful than stationary {w} in terms of GES. For example, consider the AR(1) 
model with independent and additive outliers, i.e., w, = x, + D, and suppose that 
ņ satisfies (A1)—(A4), along with 


(A5) 7(u, ©) is monotone in each variable and uv > 0 implies ņ(u, v) = 0. 


INFLUENCE FUNCTIONALS FOR TIME SERIES 847 


Then we can prove that the GES for GM and RA estimates when {v,} is allowed 
to be nonstationary is the same as when it is restricted to being stationary, and 
is given by Theorem 5.1 of Martin and Yohai (1984b). Further study is needed to 
determine the extent to which similar results may be true for other models and 
different types of contamination. 

Concerning robust tests, Robinson’s suggestion regarding robustified score 
tests is a good one which has been recently pursued (Basawa, Huggins, and 
Staudte, 1985). In fact, the general area of formal inference for robust procedures 
is in need of more attention not only in the time-series setting, but also in the 
more classical contexts such as linear regression, etc. However, the construction of 
useful finite-sample tests and confidence intervals has proved difficult enough in 
the non-time-series setting, and the problem can hardly be any easier for the 
time-series setting. 

Franke and Hannan, and Tsay are quite correct in pointing out that one 
should consider the purpose of the analysis when defining influence functionals. 
Thus if one is concerned about prediction, then one should use an appropriate 
influence functional IF, for prediction. 

If one is willing to focus on prediction based on the “good” data x,, then the 
following definition would be suitable. Consider the autoregression context: Let 
$(p",) denote the functional representation of the parameter estimates, and let 
x? = (%,_1,-.-,%,-p)} Suppose we use the linear predictor 2,(u%) = x7(3). 
Then 


d 
IF, = Gel u3 )ly-0 = x7 IF 
where IF = IF(u,,) is the influence functional for (7%). It may be convenient to 
use the square root of the average squared value of IF): 
AIF,(u,,) = (E,,IF2)'” = (IF7C,IF)”, 


Bs P 
where C, is the p X p covariance matrix of x, It is easy to check that 
Y? AIF? (u) is the squared-bias component of prediction mean-squared-error for 
small y: ofge = of + y AIER (p p). 

However, one would be considerably more interested in a measure of influence 
for robust predictors which reflects the effect of outliers in (y,")" = (YL 1... Yp) 
used as predictor variables, as well as the effect of outliers on the parameter 
estimates. Correspondingly, we expect a robust predictor of the core value x, to 
have the nonlinear asymptotic form 


£, = glyn (8 )), 


where y,”., = (Yli, Y... ). Predictors based on joint robust filter-cleaners and 
AM-type estimation (Martin, 1981; Martin and Yohai, 1985) will have such a 
form. Then one might define 


ð 
TB, = 5, Ege (Yn ol): 


Poor is interested in the possible use of IF’s in connection with long-memory 
processes. Since long-memory processes of the type mentioned by Poor do not 


848 DISCUSSION 


result in asymptotic bias for most parameter estimates, the IF is not a useful tool 
for assessing the influence of such long-memory processes. For parameter estima- 
tion problems where the rate of convergence can be maintained in the face of 
variance inflation due to long-memory contamination (the notable case where 
this is not so being that of the sample mean), perhaps an analogue of the 
change-of-variance curve CVC (Hampel, Rousseeuw, and Ronchetti, 1981; Hampel 
et al., 1986) would be a useful tool. 


Lack on consistency and bias control. Robison has made a number of 
interesting comments and suggestions having to do with the issues of asymptotic 
bias and the second-order structure of time-series contamination models. First of 
all we should recall that the spirit of robustness is that of doing well near a 
parametric model (see Huber, 1981; Hampel et al., 1986). In terms of contamina- 
tion models “near” means not too large a fraction y of contamination, but the 
contamination can be arbitrarily bad when it occurs. Obviously quite small y in 
our (2.2) can give rise to m, and c,(/)’s that are quite far from the m, and 
c,{(j)s in Robinson (1)—(4). On the other hand, when y is small, the measures p, 
and „Y, will be close in metrics which are suitable for robustness in the time-series 
setting (see for example Boente, Fraiman, and Yohai, 1982). For this reason the 
second-order viewpoint is not too appealing. 

With a view toward asymptotics one can of course put down a richer, more 
accurate class of models, perhaps from a second-order point of view as suggested 
by Robinson, and then estimate everything in sight. However, some caveats are 
in order. In the first place, the fact that we may have some knowledge of what 
type of outliers may occur does not exclude the possibility that other types may 
occur which we do not anticipate, and hence one may find it difficult to specify a 
sufficiently rich model. Furthermore, we have not run into many situations where 
the sample size is sufficient to render estimation of a rich outlier model a 
practically realizable goal. 

On the other hand there do seem to be many applications where the sample 
size is nonetheless sufficiently large that squared bias will be the dominating 
component of mean-squared error. In such situations bias control is a dominant 
robustness consideration. Hampel’s approach of optimally bounding the influence 
curve, pursued by Künsch (1984) in the autoregression context, takes a significant 
step toward obtaining analytic results with regard to bias control. However, one 
must remember the ICH and IF are infinitesimal in nature, as are optimality 
results based on them. Global robustness results are also highly desirable. 

To date the main focus of global robustness has been on the breakdown point 
(see Hampel, 1971; Huber, 1981; and Hampel et al., 1986, for definitions). Indeed, 
the problem of constructing (and computing) high breakdown point estimates has 
been a lively area of research in recent years (see for example Rousseeuw and 
Yohai, 1984; Hampel et al., 1986; Yohai, 1985; Yohai and Zamar, 1985). High 
breakdown point estimates having high efficiency may well provide the preferred 
approach in areas such as robust regression. 

On the other hand, global bias optimality results have received little attention. 
One approach to global bias optimality is to define an optimal bias robust 


INFLUENCE FUNCTIONALS FOR TIME SERIES 849 


estimate as one which minimizes the maximum asymptotic bias for a given 
fraction of contamination y. In spite of Huber’s (1964, 1981) proof that the 
sample median has this property (see also Section 2.7 of Hampel et al., 1986), the 
approach has been essentially neglected. From recent results (Martin and Zamar, 
1985; Zamar, 1985) it appears that min-max bias robust estimates, both with and 
without an efficiency constraint at the Gaussian model, can be obtained in 
situations such as estimating location, scale, and regression parameters with 
independent observations. It is hoped that one can obtain similar analytical bias 
robust solutions in the time-series setting. Again, the issue of how large a class P, 
of contaminating measures one should use will arise. Both relatively narrow and 
quite broad classes should be considered, in correspondence with a practitioner’s 
state of knowledge. 

We concur with Robinson that the identification problem should be taken 
seriously, but our emphasis in this area would be somewhat different than his, as 
reflected in the following comments on robust model selection. 


Robust model selection. Franke and Hannan, Robinson and Tsay all 
raise the issue of the interplay between model fitting and robustness, and 
implicitly this raises the issue of robust model selection. This is a thorny issue 
concerning which there is a notable lack of understanding, even in the ordinary 
regression setting. 

As Robinson has aptly pointed out, an (arbitrarily small) PR- or AO-type 
contamination results in a more complicated model. A pure autoregression 
becomes an ARMA model, and an ARMA model becomes an ARMA model with 
a higher-order moving average component, etc. Similar effects will be caused by 
almost any kind of contamination. The basic point is that in arbitrarily small 
neighborhoods of an ARMA ( po, qo) Gaussian model there will be non-Gaussian 
ARMA (pq) models with p,q arbitrarily large—and as Robinson points out, 
the covariances of the ARMA (p,q) model may be quite far from those of the 
ARMA (poy, qo) model. 

As a consequence, one cannot, for example, expect any robust procedure to 
asymptotically fit a finite-order autoregression to a contaminated time series yy 
in which x, is AR(p,.). However, GM, RA and (probably) AM estimates are 
qualitatively robust in the AR case, i.e., a small fraction of contamination y will 
produce only small biases (a proof for GM estimates is given in Boente, Fraiman, 
and Yohai, 1982). Thus, for such estimates most of the estimated AR coefficients 
will be small, and one expects that a good AR (p) fit can be made with p close 
to Po. Correspondingly, one expects to obtain a quite reasonable identification of 
the order if a good robust order selection rule is used. 

We propose that a robust model selection rule be constructed in the following 
way. Let s,(p,q) =8,( p,q, &) be a robust measure of scale of the prediction 
errors. Here n denotes sample size, and & is a robust estimate of the parameters 
of an ARMA (p, q) model, with p < P., q < Q,, and P,, Q„ nondecreasing in n. 
Robustness of both & and s,, are needed. Then choose Ô,, G, to minimize 


RMOD,( p.q) = 8,(p,q)(1 + K,,); 


850 DISCUSSION 


where K,, is a penalty term for overfitting. For a specific proposal using pure 
autoregressive fits, see Martin (1981), where a robustified AIC-type statistic was 
proposed and the efficacy of its use illustrated by example. 

Let RMOD(z7, P, Q) be the asymptotic value of RMOD,,( Ên Ân) when the 
observations are y ~ p}, and P = lim P, Q = im Q,, P, Q finite or infinite, and 
PP, <Q where tzo 18 for an ARMA (Poqo) model. The basic 
requirement is that RMOD(., P, Q) be continuous at Hz, and preferably also 
continuous at all p, in a neighborhood of u, o (where it is possible that py > P, 
qo > Q). In fact it is desirable to be somewhat more nonparametric. Simply 
assume that we agree to fit with ARMA (p,,¢,) models, but that 4, o is an 
arbitrary stationary Gaussian measure. The basic robustness property of 
RMOD(- , P, Q) should still be the same. 

It is also natural pa require “Fisher” consistency in the sense that 
RMOD( u0» P,Q) = o, where of is the innovations variance of the ARMA 
(Pos do) model. One might then try to establish consistency of p,,9, at Myo 
(Because small biases are inescapable—unless one wants to be super- 
adaptive—one cannot expect consistency of any model selection rule except at 
the nominal model u, 9.) It would also be desirable to establish optimality 
properties at p, (see, for example, Shibata, 1980; Hardle, 1985). 


Innovations outliers, intervention analysis, adaptivity, ete. Kiinsch 
points out that (2.2) is not sufficiently general to include innovations outlier 
models, which is certainly true. However, we regard this as relatively unim- 
portant for the following reasons. First of all, though heavy-tailed symmetric 
innovations distributions will produce outliers (of highly structured form), such 
distributions will not result in asymptotic bias. Even asymmetric innovations 
distributions will not result in asymptotic bias for GM and RA estimates of AR 
models, provided an intercept term is included in the model (this will even be 
true for ARMA models, but the RA and GM estimates will no longer be robust 
without “truncation”—see Bustos and Yohai, 1986). Secondly, innovations out- 
liers are often good in the sense that they are “good” leverage points which result 
in increased precision for estimates of the parameters 4,,...,¢,, as has been 
pointed out in earlier literature. Poor’s system identification problem provides 
an interesting contrary case since the innovations are replaced by measured 
system inputs u, which may be observed with contamination errors. 

Of course, innovations outliers represent just one of several kinds of deviations 
from a nominal Gaussian model which are often substantially different in 
character from the kinds of contamination-type deviations we have focussed on. 
Level shifts, changes in trend, a variety of other “shaped” changes, and time- 
varying parameters are among the problems West and Miller and Lee are 
concerned with. These kinds of behavior certainly occur with some frequency in 
economic time series, and in other subject-matter areas as well. It is clear that 
the use of intervention analysis/structured dummy variables often gives good 
results in those situations where shaped changes in a time series are attributable 
to known causes. 


INFLUENCE FUNCTIONALS FOR TIME SERIES 851 


Wegman expresses his concern with the possibility that a robust method 
might mask an effect which would be well accommodated by intervention 
analysis. We consider, on the contrary, that robust estimates have two distinct 
and useful roles in conjunction with intervention analysis. The first role occurs in 
those cases where there is enough knowledge to specify an intervention form. In 
such cases one in general still has no assurance that outliers will not cause 
problems. This can be dealt with by adapting AM, GM, or RA estimates to 
intervention models. The second role occurs in those situations where one 
overlooks the possibility of intervention modeling, or where one is rather uncer- 
tain about what intervention shape to use. A robust estimate will produce large 
residuals in locations where an intervention should be applied (the AM 
estimate /robust filter- or smoother-cleaner approach may be preferred in this 
case). These residuals will help suggest the form of the intervention and therefore 
enable its incorporation into the model. If, instead, a nonrobust procedure is 
used, the parameters of the core model can be severely biased in an effort to 
explain the overlooked intervention effects. As a result, examination of the 
prediction residuals may not reveal the need for an intervention. 

Of course the adaptive and Bayesian techniques proposed by West have 
substantial appeal. We would emphasize that for forecasting purposes, one of the 
most crucial needs is for a methodology which can assess whether or not unusual 
behavior near the end of a series is passing in nature, or represents real changes in 
the structure of the process (e.g., are the last few points additive outliers or 
innovations outliers). A Bayesian approach is quite natural and appealing. 
However, one difficulty is clearly paramount even when a user is able to specify 
good priors: There will be relatively few data points with which to estimate the 
unusual new structure, and hence even short term forecasts based on such 
changes may not be very good. One must give an honest assessment of this to the 
user. In general we would both push the Bayesian approach hard, and also force 
the user to carefully evaluate multiple forecast options (including the associated 
models and uncertainty). Perhaps this is the kind of thing West has in 
mind however, his 1986 references were unavailable to us. 

We do question West’s almost total rejection of stationarity for economic time 
series—-this runs against the grain of a considerable amount of experience 
according to which specialized adjustment for nonstationarity and structured 
effects results in a stationary core process—and it is the parameters of this core 
process which determine the confidence intervals for short-term forecasts. Also, 
one must be careful that adaptivity does not become superadaptivity with little 
precision or confidence in the model—the data can certainly be fitted too well 
(see Los, 1985). 

West is correct in saying that omnibus robust methods will in some cir- 
cumstances tend to oversmooth the data, and this is an issue which one certainly 
must pay attention to. It should be noted, however, that a good smoother-cleaner 
(Martin, 1979) handles isolated outliers or short patches nicely (just as does a 
good filter-cleaner, e.g., as in Martin, Samarov, and Vandaele, 1983) while at the 
same time making rapid transitions (not oversmoothing.) at level shifts (where a 
filter-cleaner may result in smearing of the shift). Furthermore, we would never 


852 DISCUSSION 


recommend blind use of an “omnibus” method to the exclusion of other reason- 
able procedures. 

In fact, one needs a variety of methodologies, robust and otherwise, at one’s 
disposal, and the good data analyst follows Tukey’s dictum of multiple analysis. 
Among the methodologies we should have in hand are those which combine 
robustness with other features. For example, in answer to one of West’s ques- 
tions: The extension of robust methods, particularly AM-type estimates (see 
Martin and Yohai, 1985), to cover estimation of the fixed parameters in the 
dynamic/time-varying parameter problem seems quite feasible. Also, we do not 
see any problem in applying the current IF concept to estimation of the fixed 
parameters in dynamic models with time varying parameters. 

West raises some very interesting questions about outliers and binary time- 
series models, about which we have not thought very much (but are stimulated to 
do so very soon). 

Some of the questions raised by Miller and Lee have been covered by our 
preceding discussion, and we shall respond to a few others. It is true that we 
should assume joint stationarity of (x, w, z7). Note, however, our comments 
above concerning IF’s for nonstationary processes. With regard to assumptions 
on the estimator sequence {T,,}, consistency is the main requirement. The 
important point with regard to the domain of T is that T(?,) be well defined by 
(3.3). The fact that ICH is defined for measures that are not stationary and 
ergodic is quite consistent with the fact that, in general, the directional derivative 
of T in the direction determined by 6, is not the same as the derivative IF along 
a stationary arc. 

While it is true that outliers with no assignable causes may deserve to have 
full weight, it is equally true that they may deserve to be downweighted. One 
must distinguish between downweighting in estimating structural parameters 
and downweighting for estimating error variances and for forecasting. There is 
usually little harm in downweighting for estimating structural parameters— 
at most some efficiency is lost. Forecasting is quite another matter, which we 
have commented upon above. With regard to error-variance estimates: it is true 
that a robust residuals scale estimate can result in a considerably smaller 
estimate of variance of future observations than the usual sum-of-squares esti- 
mate. However, how much reliability will one put on the latter type of estimate 
when it is influenced quite heavily by a very small number of observations? 

With regard to the above issues, the recognition of PR- and AO-type behavior 
versus innovations outliers- (IO) like behavior is relevant. Miller and Lee cor- 
rectly note the difficulty of assessing AO structure using conventional methods, 
and the poor quality of ARMA(1, 1) least-squares fits in such situations. The 
latter point is hardly surprising since the ARMA(I, 1) structure may often be 
determined by just a small fraction of observations, and only a very large sample 
size would then result in good estimates via a least-squares/second-order fit. 
Robust estimates on the other hand have at least some useful role in both 
supplying good parameter estimates and checking for IO versus AO structure. 
Some evidence concerning the latter point is provided by Martin and Zeh (1977). 

We would also address an attitude which permeates the Miller and Lee 
viewpoint—and which is held by others, particularly those who concentrate on 


INFLUENCE FUNCTIONALS FOR TIME SERIES 853 


analysis of economic time series. Namely, “the analyst knows enough about his 
data to provide sufficient model structure to accommodate all conceivable prob- 
lems, including contamination by outliers” (take away a competent statistical 
modeling capability, and this becomes a rather antistatistical attitude). Thus 
robust techniques are not needed, they probably are not to be trusted anyway, 
and they certainly are not completely developed. True, some analyst will do quite 
well without robust techniques most of the time. However, oúr experience is that 
many will do not so well much of the time, and this group will often benefit from 
the availability of good robust techniques to aid their analysis. Even the first 
group will sometimes benefit from the use of robust procedures by virtue of 
discovering the difficulties in the data more quickly. 

Although the last sentence of Miller and Lee makes a valid point, it also 
reveals a certain myopia concerning the nature of the universe of time series. This 
universe is incredibly large and diverse, and the “other events” which impact 
economic time series would be regarded as highly specialized by a radar engineer 
(who might be concerned with real-time problems) or an oceanographer, for 
example. We can think of many users who would be quite delighted to have 
available robust estimates which promise only good estimates of parameters of 
core processes. 

Finally, we certainly do not believe that robust procedures are a be-all and 
end-all. They are simply an often useful statistical tool which should be on the 
shelf with other standard statistical methods for the user to choose. Intervention 
analysis and other modeling techniques, such as those of West, and robust 
estimation for time series are methodologies which can and should live happily 
next to one another, and as such they will be mutually complimentary. 


Donoho’s comments. Although not a formal discussant, Dave Donoho has 
raised a number of very interesting questions concerning our paper. We respond 
to some of them here. The first has to do with the relationship between IF and a 
“Hampel” influence curve IC = IC, (») defined as the directional derivative of 
T(p) at u, in the direction of stationary ergodic measures v (rather than in the 
direction of nonstationary point masses 6, as in (4.1). With a = (1 — y) + Y, 
we have 


IC, (») = im T(n,) — 


Let A, be the linear functional (defined on the set of signed measures for which 
ICH is integrable) given by 


A, (>) = J ICH(y,,t) dv. 
Then A,(v) = IC,(v), and Theorem 4.1 states that 
F(u, T, {uy }= ide) =A, aed fe 


where 


854 DISCUSSION 


and A,(p,) = 0. The tangent line to the arc p} at y = 0 is given by 
yy = (L= Y) + Y3. 

Although »* is a signed measure, it is not necessarily a probability measure and 
it need not be bounded. Thus IF does not in general have any heuristic 
interpretation as IC,(v) = A,(v) with » such that p, = (1 — y)», + yr corre- 
sponds to a mixture of processes. Furthermore, we do not see any simple way to 
determine A,(v;*) using the values of A,(v) with » ranging over the class of 
stationary ergodic measures. 

Nonetheless a “long-patch” interpretation suggested by Donoho is correct: If 
v is an ergodic stationary probability measure, and p% * ig the probability 
measure corresponding to a process y * defined by observing patches of length k 
of a contaminating process with measure » a fraction of time y, and observing the 
nominal process x, the rest of the time, then under regularity conditions 


A,(o) = in (r,t, (05). 


Donoho has also suggested that it would be interesting to determine GES’s 
using the largest natural class P of measures in (6.1), which would be the class of 
all possible arcs {7} corresponding to processes defined by (2.1) and (2.2). The 
corresponding “least-favorable’ measure would appear to be of considerable 
interest. It follows from our comments above that the GES in question is given 
by 

GES= sup |A, (>?) 
wy EP* 

where P* is the family of all stationary ergodic signed measure vf as specified 
above. Unfortunately, it appears at the moment to be difficult to compute the 
least favorable measure 7*. It seems likely that the least favorable arc {g3} will 
correspond to a process yy with zY depending on the nominal process x,, the 
reason being that GES should be attained by placing outliers where they will be 
most harmful and this would depend on x,. Thus the least favorable arc may 
correspond to a rather complicated process. Perhaps some real effort here will 
nonetheless pay off. 

One other question raised by Donoho was “What does a fixed finite patch 
length, say k = 20, mean, when the sample size goes to infinity?” Let’s focus 
briefly on GM estimates of order p autoregressions for simplicity, where the 
estimate has a “span” of p + 1. Then the length & of the patch relative to the 
span p + 1 determines the proportion of end effects of the patch relative to 
the estimate (an end effect occurs when the patch does not cover the entire span), 
and thus & should clearly affect IF. 

In the general case one also expects IF to depend upon k, and furthermore 
patch length effects have their own asymptotics (see Theorems 5.2(ii) and 
Corollary 5.3 of Martin and Yohai, 1984b): patch length asymptotics have set in 
when & is such that 

Lė. z 
È YW oto) = Ey(w,,ty). 
iS 


This depends on: (i) How fast }(w},xo,t)) is approximated by (w,,t,), and 


INFLUENCE FUNCTIONALS FOR TIME SERIES 855 


(ii) how fast the ergodic theorem holds for k 'E$_ ,¥(w,, to). Factor (i) depends 
upon the patch length and effective span of yp. 


Vote of thanks. Borrowing on a nice tradition of the Royal Statistical 
Society, we offer our vote of thanks to the discussants. 


Acknowledgment. The authors wish to thank Adrian Raftery for some 
useful comments made during the preparation of this rejoinder. 


REFERENCES 


Basawa, I. V., Hucarns, R. M. and STAUDTE, R. C. (1985). Robust tests for time series with an 
application to first-order autoregressions. Biometrika 72 559-572. 

BOENTE, G., FRAIMAN, R. and Youal, V. J. (1982). Qualitative robustness for general stochastic 
processes. Technical Report No. 26, Dept. Statist., Univ. of Washington, Seattle. 

Box, G. E. P. and Tiao, G. C. (1975). Intervention analysis with applications to economic and 
environment problems. J. Amer. Statist. Assoc. 70 70-79. 

Bruce, A. and Martin, R. D. (1986). Leave-k-out diagnostics for time series. Technical Report, 
Dept. Statist., Univ. of Washington, Seattle, in preparation. 

HAMPEL, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42 
1887-1896, 

HAMPEL, F. R., Roussgeuw, P. J. and RoNcHETTI, E. (1981). The change-of-variance curve and 
optimal redescending M-estimators. J. Amer. Statist. Assoc. 76 643-648. 

HAMPEL, F. R., RONCHETTI, E. M., RoussEEuw, P. J. and STAHEL, W. A. (1986). Robust Statistics: 
The Approach Based on Influence Functions. Wiley, New York. 

HARDLE, W. (1985). An efficient selection of regression variables when error distribution is incorrectly 
specified. Mimeo Series No. 1582, Dept. Statist., Univ. of North Carolina. 

Los, C. A. (1985). Discussion contribution to paper by West, Harrison and Migon. J. Amer. Statist. 
Assoc. 80 92-93. 

Martin, R. D. (1979). Approximate conditional-mean type smoothers and interpolators. In Smooth- 
ıng Techniques for Curve Estimation (T. Gasser and M. Rosenblatt, eds.) 117-143. 
Springer, Berlin. 

Martin, R. D., SAMAROV, A. and VANDAELE, W. (1983). Robust methods for ARIMA models. In 
Applied Tıme Sertes Analysis of Economic Data. (A. Zellner, ed.) 163-169. Economic 
Research Report ER-5, Bureau of the Census, Washington. 

Martin, R. D. and ZAMAR, R. (1985). Efficient min-max bias M-estimates of location and scale. 
Technical Report No. 72, Dept. Statist., Univ. of Washington, Seattle. 

MARTIN, R. D. and Zen, J. E. (1977). Determining the character of time series outliers. Proc. ASA 
Bus. Econ. Statist. Sec, 818-823. Amer. Statist. Assoc., Washington. 

RoussERUW, P. and Youal, V. (1984). Robust regression by means of S-estumators. In Robust and 
Nonlnear Tune Serves Analysis (J. Franke, W. Hardle and D. Martin, eds.) 256-272. 
Springer, New York. 

SHIBATA, R. (1980). Asymptotically efficient selection of the order of the model for estimating 
parameters of a linear process. Ann. Statist. 8 147-164. 

Youat, V. J. (1985). High breakdown point and high efficiency robust estimates for regression. 
Technical Report No. 66, Dept. Statist., Univ. of Washington, Seattle. 

Youal, V. J. and ZAMAR, R. H. (1985). High breakdown point estimates of regression by means of 
minimization of an efficient scale. Technical Report No. 81, Dept. Statist., Univ. of 
Washington, Seattle. 

ZAMAR, R. H. (1985). Robust estimation for the errors in variables model. Ph.D. dissertation, Dept. 
Statist., Univ. of Washington, Seattle. 


DEPARTMENT OF STATISTICS, GN-22 DEPARTMENT OF MATHEMATICS 

UNIVERSITY OF WASHINGTON UNIVERSITY OF BUENOS AIRES 

SEATTLE, WASHINGTON 98195 CUIDAD UNIVERSITARIA, PABELLON 1 
1428 BUENOS AIRES 


ARGENTINA 


The Annals of Statistics 
1986, Vol. 14, No. 3, 856-873 


LIKELIHOOD AND OBSERVED GEOMETRIES 


By O. E. BARNDORFF-NIELSEN 
Aarhus University 


In the differential geometric approach to parametric statistics, developed 
by Chentsov, Efron, Aman, and others, the parameter space is set up as a 
differentiable manifold with expected information as metric tensor and witha 
famıly of affine connections, the a-connections, determined from the expected 
information and the skewness tensor of the score vector. The usefulness of 
this approach ıs particularly notable in connection with Edgeworth expan- 
sions of estimators. Motivated by the conditionality viewpoint, an “observed” 
parallel to that theory is established in the present paper using observed 
information and an “observed skewness” tensor instead of the above expected 
quantities. The formula c|j|'/“Z for the conditional distribution of the 
maximum likelihood estimator is expanded (to third order) asymptotically 
and the “observed geometries” are shown to have a role in this type of 
expansion similar to that of the “expected geometries” in the Edgeworth 
expansions mentioned above, In these new developments “mixed derivatives 
of the log model function,” defined by means of an auxiliary statistic comple- 
menting the maximum likelihood estimator, take the place of moments of 
derivatives of the log likelihood function. 


1. Introduction. A number of recent investigations have shown that in the 
study of inference for parametric statistical models, particularly as regards 
higher-order asymptotics, it is useful and illuminating to set the model, Æ say, 
up as a differentiable manifold equipped with a Riemannian metric and a family 
of affine connections, the so-called a-connections. In that approach, the parame- 
ter space of the model serves for the coordinate representation of æ, the metric 
tensor employed is the expected information matrix 


(1.1) bg = ~E{a, dl} 


and the family of a-connections is determined by (1.1) and the so-called skewness 
tensor 


(1.2) Tyst = E{ð,l ôl ôl) 


which is a covariant tensor of rank 3. Here / denotes the log likelihood function of 
the model, and with w, of dimension d, as the parameter of the model, we write 
w = (w',...,@%) and 0, = d/dw". The indices r,s, ¢,... run over 1,2,...,d. In 
this framework it is, for instance, possible to give geometrical interpretations to 
various of the terms arising in conditional and unconditional Edgeworth expan- 
sions for the distribution of the maximum likelihood estimator ô under curved 


Received January 1985; revised September 1985. 

AMS 1980 sulyect classifications. Primary 62F99; secondary 62A10 

Key words and phrases. Alpha connections, ancillarity, asymptotic expansion, Bartlett factors, 
conditionality, constant normal fractile, covariant differentiation, exponential model, hyperboloid 
model, inverse Gaussian—Gaussian model, location—scale model, maximum likelihood, observed infor- 
mation, tensors. 


856 


LIKELIHOOD AND OBSERVED GEOMETRIES 857 


exponential models. For these theoretical developments see Efron (1975), Amari 
(1982a, b, 1983, 1984, 1985), and Amari and Kumon (1983) and the references 
given there. 

For many purposes the observed information matrix J, i.e., 


Jrs T ô, ĝl, 


is more natural to work with than the expected information i, and it therefore 
seemed of interest to enquire whether the model Æ can also be rigged with some 
kind of “observed geometrical structures” paralleling the “expected geometrical 
structures” given by (i, T) as defined by (1.1) and (1.2). We shall show that this is 
indeed the case and that the resulting geometries are intimately connected with 
a certain type of asymptotic expansion deriving from the formula c| HL 
[Barndorff- Nielsen (1980, 1983)] for the conditional distribution of the maximum 
likelihood estimator. 

These new types of statistical geometries and expansions notably do not 
involve integrations over the sample space, as is required in (1.1) and (1.2) and in 
the calculation of the cumulants that occur in the Edgeworth expansions. Instead 
they employ what may be referred to as mixed derivatives of the log model 
function. 

Furthermore, whereas the studies of expected geometries have been largely 
concerned with curved exponential families, the approach taken here makes it 
equally natural to consider other parametric models, and in particular transfor- 
mation models. 

The viewpoint of conditional inference has been instrumental for the construc- 
tions in question. However, the observed geometrical calculus, as discussed below, 
does not presuppose the existence of exact or approximate ancillaries but only 
operates with an auxiliary statistic a complementing the maximum likelihood 
estimate 6. Only when it comes to applications to problems of inference does 
distribution constancy—and hence ancillarity—of a become essential. 

Let the model Æ be given by (F, p(x; w), Q) where X is the sample space, 2 
is the parameter space, and p(x; w) is the model function, i.e., for a given value 
of the parameter w the function p(x; w) is the probability density function of the 
observation x € % relative to a fixed dominating measure u on X. Suppose the 
minimal sufficient statistic t for Æ is of dimension k. We then speak of Æ as a 
(k, d)-model (d being the dimension of the parameter w). Let (&,a) be a 
one-to-one transformation of t, where ô is the maximum likelihood estimator of 
w and a, of dimension k — d, is an auxiliary statistic. 

In most applications it will be essential to construct a so as to be distribution 
constant either exactly or to the relevant asymptotic order. And then, according 
to the conditionality principle the conditional model for & given a is considered 
the appropriate basis for inference on w. 

However, distribution constancy of a is not assumed in the construction of the 
observed geometries. 

There will be no loss of generality in viewing the log likelihood 1 = Hw) in its 
dependence on the observation x as being a function of the minimal sufficient 
(ô, a) only. Henceforth we shall think of Z in this manner and we will indicate 


858 0. E. BARNDORFF-NIELSEN 


this by writing l = 1(w; , a). Similarly, in the case of observed information we 
write J = j(w; ô, a), etc. We may now take partial derivatives of | with respect 
to the coordinates â” of ô as well as with respect to w”. Letting ð = 0/8" we 
introduce the notation 


(1.3) burne sty = On oe Ory Os, ++ Od 
and refer to these quantities as mixed derivatives of the log model function. The 
function of w and a obtained from (1.3) by substituting w for ô will be denoted 
by J, or 8)...ag° Thus, for instance, 


fsi = a (w) = he, jo; a) = Lee, (0; ®, a). 
Similarly, 
j= He) = Ho; a) = jlo; w, a). 
The observed geometries, which will be introduced and illustrated in Section 2, 
are expressed in terms of the mixed derivatives 
(1.4) houna 83 


q 


So are the terms of an asymptotic expansion of 

(1.5) p*(â; ola) = ej), 

to be derived in Section 3. In (1.5) L denotes the normed likelihood function, i.e., 
L=et?, 

| J| is the determinant of the observed information, and c = c(w, a) is a norming 

constant determined so as to make the integral of p*(@; wla) with respect to ô 

for fixed a equal to 1. 

For a ancillary the model function p*, given by (1.5), may be considered as an 
approximation to the actual model function p(é; w|a) for the maximum likeli- 
hood estimator â conditional on a. As such it is, in wide generality, correct to 
order O(n~*/*) at least, under repeated sampling with n denoting sample size. In 
fact, p*(ô; w|a) equals p(; w|a) exactly for a considerable range of models, 
including all transformation models, cf. Barndorff-Nielsen (1980, 1983, 1984b) and 
Barndorff-Nielsen and Blæsild (1984). Some further discussion and applications 
of (1.5) may be found in Barndorff-Nielsen and Cox (1984a, b), Barndorff-Nielsen 
(1984a, 1985a, b) and McCullagh (1984a). In particular, in Barndorff-Nielsen and 
Cox (1984a) a simple relation is established between the norming constant c of 
(1.5) and the Bartlett adjustment factors for log likelihood ratio tests of hypothe- 
ses about w. [See also Barndorff-Nielsen and Cox (1984b).] We comment on this 
relation in Section 3. 

Besides being in a certain sense “closer to the actual data at hand,” the 
“observed” quantities and formulas are in various respects simpler to work with 
than their expected counterparts. For instance, in certain cases Bartlett adjust- 
ment factors are more readily calculable in terms of the observed quantities. 
Another example is provided by formula (3.15), cf. the discussion following that 
formula. 


LIKELIHOOD AND OBSERVED GEOMETRIES 859 


Some connections between expected and observed geometries and profile 
likelihood, L-sufficiency, marginal likelihood, and transformation models have 
been studied in Barndorff-Nielsen and Jupp (1984, 1985). 


2. Observed geometries. We shall be interested in how various quantities 
behave under reparametrizations of the model æ. Let y, of dimension d, be the 
parameter of some parametrization of Æ, alternative to that indicated by w. 
Coordinates of y will be denoted by 4%, W®, etc. and we write 0, for 0/dW% and 
wia for Iu'/IV%, wg, for d°a'/dy* dy’, etc, Furthermore, we write Uy) for the 
log likelihood under the parametrization by 4, though formally this is in conflict 
with the notation (w), and correspondingly we let l, = 0,1 = 2,l(W), etc.; 
similarly for other parameter-dependent quantities. Finally, the symbol * over 
such a quantity indicates that the maximum likelihood estimate has been 
substituted for the parameter. 

Using this notation and that established in Section 1, and adopting the 
summation convention that if a suffix occurs repeatedly in a single expression 
then summation over that suffix is understood, we have 


(2.1) la = lw 
(2.2) lab mr latat + Leaps 
(2.3) labe a D517 Oye + L744 13] a LW aber 


etc., where [3] signifies a sum of three similar terms determined by permutation 
of the indices a, b, c. On substituting ô for w in (2.2) we obtain the well-known 
relation 

Jab = Jrs aÂ 
which, now by substitution of w for 4, may be reexpressed as 
(2.4) da F Ta" 
or, written more explicitly, 

dw” dw? 


das ¥3 a) = 1,(@; a) > aye aye 


Equation (2.4) shows that jis a metric tensor on æ, for any given value of the 
auxiliary statistic a. Moreover, in wide generality j will be positive definite on 
A, and we assume henceforth that this is the case. In fact, for any 6 € Q we 
have j= = j, i.e., observed information at the maximum likelihood point, which is 
generally positive definite (though counterexamples do exist). 

Equipped with J as metric tensor. becomes a Riemannian manifold. Notice 
however that this Riemannian geometry depends on the value of the auxiliary a. 
We call J the observed metric on Æ. 


The Riemannian connection determined by j has connection symbols fe, 
given by re, = jf Fai and 


0 
Fo = (a rhi- CRM + a ade). 


860 O. E. BARNDORFF-NIELSEN 


Employing the notation established in Section 1 we have 3 f, = —0/,,= 
— frat ka os, 0 etc., 80 that 


0 
(2.5) Fi ai rs, E Hose + la, :[3]). 
As we shall now show, the quantity 

(2.6) Trst =e (fost + rs, L3]) 

is a covariant tensor of rank 3, i.e., 

(2.7) Tase ai Trsta bto 

First, from (2.3) we have 

(2.8) lavc = | RV + fral] 


Further, from (2.2) we obtain, on differentiating with respect to $° and then 
substituting parameter for estimate, 


(2.9) fate = Hs, 107g 74070 + A 107a bhe 
Finally, differentiating the likelihood equation 
j =0 
we find 
(2.10) letha 9, 
or 


(2.11) L., 8 T i a 


Combination of (2.6), (2.8), (2.9), and (2.11) yields (2.7). 
It follows from the tensorial nature of 7 and from (2.5) and (2.11) that for any 


real a an affine connection Ý on Æ may be defined by 


ro = J"F mou 
with 
(2.12) F „st =e he. t + 


In particular, we have 


l-a 


Trst- 


2 

1 -1 

Psi z Ls ý rst = li, ra? 
where to obtain the latter expression we have used 


frs: + rs,t + bce T r, st =0 
which follows on differentiation of (2.10). It may also be noted that 


1 -1 1 =I 
dihs = Frs + F oer = Fart Pri 


LIKELIHOOD AND OBSERVED GEOMETRIES 861 


and 
a l+al l-a-— 
F pet = Pact | ea 
2 2 


a 
The connections F, which we shall refer to as the observed a-connections, are 
a 


analogues of the expected a-connections T of Chentsov (1972) and Amari (1982a), 
which are given by 


a a 
E _. stu 
Dre =} Tise 


and 


Epa = Elinel) + Toe 
where T is the skewness tensor (1.2). The analogy between T and f becomes 
more apparent by rewriting T,,, a8 
Trst = SE{l; + 1,,13]} , 
the validity of which follows on differentiation of the formula 
(2.13) E{l,, + ll,} = 0, 


which, in turn, may be compared to (2.10). 

Under the specifications of a of primary statistical interest, one has that in 
broad generality the observed geometries converge to the corresponding expected 
geometries as the sample size tends to infinity. 

For (k, k) exponential models 


(2.14) p(x; 9) = a(8)b(x)e® @, 


we have /=i and f= T, a € R. More generally, for a curved subfamily of 
(2.14), given by restricting 0 to be of the form 8 = 6(w) where the dimension d of 
the parameter w is less than k, the quantities j and f possess, under mild 
regularity conditions, asymptotic expansions the first terms of which are given by 


(2.15) he = t = Oirbh a^ + SS 
and 


(2.16) Trot a Trst T { 549560788, [3] 


+ ,,9/7/13] i PIIRA PeT 


Here suffices ¿ and j run from 1 to k, 6', 07 denote coordinates of 6, k,, = ĝ, 0,, 
where xk = «(@) = —log a(@) is the cumulant transform of t, and ĝ, = 0/00", and 
aù \=1,...,k—d, are coordinates of an ancillary complement of â. For 
instance, in the repeated sampling situation and letting a, denote the affine 
ancillary, as defined in Barndorff-Nielsen (1980), we may take a = n~/7a, and 
the expansions (2.15) and (2.16) are asymptotic in powers of n`., [For further 


862 0. E. BARNDORFF-NIELSEN 


comparison with Amari (1982a) it may be noted that the coefficient in the 
e 
first-order correction term of (2.15) may be written as 8; bjx, =nH a 
e 
where H,,, is Amari’s notation for the exponential curvature, or a-curvature 


with a = 1, of the curved exponential model viewed as a manifold imbedded in 
the full (k, k) model.] 

We now briefly consider four examples. In the first three the model is 
transformational and the auxiliary statistic a is taken to be the maximal 
invariant statistic, and thus a is exactly ancillary. In the fourth example a is 
only approximately ancillary. Examples 2.1, 2.3, and 2.4 are concerned with 
curved exponential models whereas the model in Example 2.2—the location-scale 
model—is exponential only if the error distribution is normal. 


EXAMPLE 2.1. Constant normal fractile. For known a € (0,1) and c€ 
(— 00,00), let M. denote the class of normal distributions having the real 
number c as a-fractile, i.e., 

Nec = {N(u, 0°): (c - p)/o = ua}, 
where u, denotes the a-fractile of the standard normal distribution, and let 
Xis». +, x, be a sample from a distribution in %,, .. The model for x = (x,,...,%,) 
thus defined is a (2,1) exponential model; except for u, = 0 when it is a (1,1) 
model. Henceforth we suppose that u, + 0, i.e, a #4. The model is also a 
transformation model relative to the subgroup G of the group of one-dimensional 
affine transformations given by 


G= {[e(1 - A), A]: A> O}, 
the group operation being 
[e(1 - A), A] [e(1 — NX), X] = [e(1 — AN), AN] 
and the action of G on the sample space being 
[e(l —A), A] (xis... £n) = (e(l — A) +Ax,,...,c(1—A) + Ax,). 


(Note that G is isomorphic to the multiplicative group.) 
Letting 


a= (x -c)/s’, 


where ¥ = (x, + -++ +x,)/n and 
222 È (r, -E 
3 = — x,—- x), 
N pm] 


we have that a is maximal invariant and, parametrizing the model by = log o, 
that the maximum likelihood estimate is 


Ê = log(bs’), 
where 





b = b(a) = (u,/2)a + jit {(u,/2)° + 1}a?. 


LIKELIHOOD AND OBSERVED GEOMETRIES 863 


Furthermore, Ê, a) is a one-to-one transformation of the minimal sufficient 
statistic (¥, 8’) and a is exactly ancillary. 
The log likelihood function may be written as 
Ug) = Ut; Ê, a) = n|f a= T i + (u, + ab-'el-1)"V], 


from which it is evident that the model for Ê given a is a location model. 
Indicating differentiation with respect to ý and Ê by subscripts { and Ê 
respectively, we find 


l, = n{-1 + b-%e%F- + abu, + abef Jes}, 
and hence 
j=n{2b-? + ab-“{u, + 2ab-')}, 
fey = n{4b7? + ab“ (u, + 4ab~')}, 


1 
fee = —n{4b-? +ab™'"(u, + 4ab-')} =f, 
-1 l 
Ie te = n{4b~? +ab-'(u, + 4ab~')} =;=-f, 
and the observed skewness tensor is 
f= n{8b~? + 2ab“(u, + 4ab-')}. 
Note also that 


oe 


We mention in passing that another normal submodel, that specified by a 
known coefficient of variation u/c, has properties similar to those exhibited by 
Example 2.1. 


EXAMPLE 2.2. Location-scale model. Let data x consist of a sample x,,..., x, 
from a location-scale model, i.e., the model function is 


ptere.e) = orf (22) 


t=] 





for some known probability density function f. We assume that {x: f(x) > 0} is 
an open interval and that g = —log f has a positive and continuous second-order 
derivative on that interval. This ensures that the maximum likelihood estimate 
(Ê, 6) exists uniquely with probability 1 [cf., for instance, Burridge (1981)]. 
Taking as the auxiliary a Fisher’s configuration statistic 
Pen = 
a = (a,,...,a,) = (AS... = =|, 





6 
which is an exact ancillary, we find 
-| 28”(a,) La,g”(a,) 
ip, 0) =o? a, 2 agr? 
Za,g”(a,) n+ ÈŁaĉ?g”(a,) 


864 O. E. BARNDORFF-NIELSEN 


and, in an obvious notation, 
fana = 0 °2e’"(a,), 
fno = -0™°Ea g" (a,), 
hop = —07°{22g"(a,) + Zag” (a,)}, 
fea = —o *(22a,g"(a,) + Zaig” (a,)}, 
fos,a = —9 *{42a,8"(a,) + Za?g” (a,)}, 
has = —o {Qn + 43a’e"(a,) + Dag MS 
=o *3g""(a,), 
fino = 9 *(23g"(a,) + Zag” (a,)), 
hios = 0° *(480,8"(a,) + 2a% (a), 
o`°{4n + 6Za?g”(a,) + Zag” (a,)}. 


a 

a 

a 
i 


Furthermore, 

Ton = 207 ilO a); 
—20~*7,,((0,1); a) + 207%F,,,.((0,1); a), 
—40~"7,,((0,1); a) + 207 *f,,.,((0,1); a), 
= 607°], ((0,1); a) + 207 f,,4((0, 1); a). 


EXAMPLE 2.3. Hyperboloid model. Let (u,, v,),...,(U,, v,) be a sample from 
the hyperboloid distribution 


p(u, v; x, p) = (27) *Ae*sinh u exp[ —A{cosh x cosh u 


AAS 
a a E 
a T a 

Il ll I 


(2.17) 
-sinh x sinh u cos(v — ẹ)}]. 


Here 0 < u < œ, 0 < v < 2r and the parameters x and 9 vary in [0, œ) and 
[0, 27), respectively, while À > 0 is a precision parameter which we consider as 
known. 

This distribution is analogous to the von Mises—Fisher or Langevin distribu- 
tion for three-dimensional unit vectors, but pertains to observations on the 
positive unit hyperboloid in R? rather than the unit sphere. The distribution was 
introduced in Barndorff-Nielsen (1978) and its most important properties, includ- 
ing those on which we build below, have been unravelled by Jensen (1981); see 
also Bleesild and Jensen (1981). 

The hyperboloid model (2.17) is a transformation model, the acting group 
being the special pseudo-orthogonal group SO ‘(1, 2), and 


a= { (2 cosh w,)” — (Z sinh u,cos v,)” — (Ssinh u,sin v,)°} 


is maximal invariant after minimal sufficient reduction. Furthermore, the 


LIKELIHOOD AND OBSERVED GEOMETRIES 865 


maximum likelihood estimate (X, $) of (x, p) exists uniquely, with probability 1, 
(a, £, @) is minimal sufficient, and the conditional distribution of (£, ¢) given the 
ancillary a is again hyperboloidic, as in (2.17) but with u, v, and À replaced by X, 
ĝ, and aà (the von Mises—Fisher distribution having a similar property). It may 
also be noted that s = a — n follows the gamma distribution 


ye 
T(n-1)° 
It follows that the log likelihood function is 
U(x, p) = Ux, 0X, , a) 
= —aA{cosh x cosh X — sinh x sinh X cos(ĝ — p)} 


n=2o7às, 


and hence 
Firx A Ps = Fis = Fors = 0, 
F spp = aà cosh x sinh x, 


Foss = —adcosh x sinh x, 
whatever the value of a. Thus, in this case, the a-geometries are identical. 


We note again that whereas the auxiliary statistic a is taken so as to be 
ancillary in the various examples discussed—exact distribution constant in the 
three examples above and asymptotical distribution constant in the one to follow 
—ancillarity is no prerequisite for the general theory developed in this paper. 

Furthermore, let a be any statistic which depends on the minimal sufficient 
statistic £, say, only and suppose that the mapping from ¢ to (ô, a) is defined and 
one-to-one on some subset 7, of the full range 7 of values of ¢ though not, 
perhaps, on all of 7. We can then endow the model æ with observed geometries, 
in the manner described above, for values of t in J3. The next example illustrates 
this point. 

The above considerations allow us to deal with questions of nonuniqueness and 
nonexistence of maximum likelihood estimates and nonexistence of exact 
ancillaries, especially in asymptotic considerations. 


EXAMPLE 2.4. Inverse Gaussian—Gaussian model. Let x(*) and y(°) be 
independent Brownian motions with a common diffusion coefficient o? = 1 and 
drift coefficients u > 0 and é, respectively. We observe the process x(°) until it 
first hits a level x, > 0 and at the time u when this happens we record the value 
v = y(u) of the second process. The joint distribution of u and v is then given by 


p(u, v; p, £) = (27) "xge*"u~Zexp| — 3(22 + v)u] 


(2.18) 
. Xexp|-}u?u + go — i?u]. 


866 O. E. BARNDORFF-NIELSEN 


Suppose that (u, vi), ---, (Un 0,) is a sample from the distribution (2.18) and 
let ¢ = (ū, 6), where u and 6 are the arithmetic means of the observations. Then 
t is minimal sufficient and follows a distribution similar to (2.18), specifically 


a i n 
p(Ā, 5; p, £) = (27) 'xonerzeā exp] — z + e)a] 
(2.19) 


. Wag i n a] 
—— + egy ee 3 
exp gh u nid zeu 


Now, assume £ equal to u. The model (2.19) is then a (2,1) exponential model, 
still with ¢ as minimal sufficient statistic. The maximum likelihood estimate of u 
is undefined if t ¢ %, where 

Jy = {t= (ŭ, 5): xo + B= O}, 
whereas for t E€ Jg, ji exists uniquely and is given by 
(2.20) A=1(x,+0)a7 


The event t € J, happens with a probability that decreases exponentially fast 
with the sample size n and may therefore be ignored for most statistical 
purposes. 
Defining, formally, fi to be given by (2.20) even for t ¢ % and letting 
a = © (i; 2nx2,2nĝ?), 


where ®-~(-; x, y) denotes the distribution function of the inverse Gaussian 
distribution with density function 
(2.21)  @`(x;x, y) = (27) Ax eva /exp[—1(xx-1 + yx)], 


we have that the mapping t —> (f, a) is one-to-one from J= {t = (u, ò): u > 0} 
onto (— œ, +00) X (0,00) and that a is asymptotically ancillary and has the 
property that p*(ĝ; pla) = c| J|} ZL approximates the actual conditional density 
of fi given a to order O(n~*”), cf. Barndorff-Nielsen (1984a). 

Letting ®_(-; x, y) denote the inverse function of ®~(-; x, Y) we may write 
the log likelihood function for p as 


I(un) = U(u; ĝ, a) 

(2.22) = n{ (x + oja — ūp’} 

= n®_(a;2nx2,2nfi?){2fu — p?}. 
From this we find 

lpp = —2n0_(a;2nx?, 2njfi), 
so that 

j= 2h®_(a;2nx2,2np*), 

=0 


Fun 


LIKELIHOOD AND OBSERVED GEOMETRIES 867 


and 
= 8n?p( p7 2 ®_/2;)(a; 2nx?, 2np”) 


1 = 1 


z Tay sery pup? 


where ®7 denotes the derivative of ®~(x; x, Y) with respect to y. By the 
well-known result [Shuster (1968)] 
(x; x, y) = (yx! na xx!) + evr o(— (yx? os xx), 


where ® is the distribution function of the standard normal distribution, ®; 
could be expressed in terms of ® and ọ = ®’. 


For any m = 2,3,... a covariant tensor on Æ of rank m is given by 
(2.23) E{a,l ace a, l} ; 
the first two of which, i.e., i and T, determine the expected geometries studied by 
Amari and others, as discussed briefly in the foregoing. The tensors jand T are 
observed analogues of į and T and it seems natural to enquire whether, similarly, 
there exists observed analogues of (2.23) for m > 3. 


For m = 4 an approach like that used above to derive 7 does not, in any 
obvious way at least, lead to a fourth-rank tensor. However, one may proceed 


otherwise by noting that 7 equals the covariant derivative of j relative to the f 
connection. In fact, denoting the operation of covariant differentiation with 


respect to w” and relative to Ý by P, we have 


(2.24) D ihe = Pehea ~ ids = Fi ru 


=a rat 


1 a 
and hence T. = PJ,- Similarly, with D indicating covariant differentiation as 
determined by the expected connection I, we have 
(2.25) Di, = of 


rst* 
Š 


Formulas (2.24) and (2.25) are special cases of a more general differential-geomet- 
ric result due to Lauritzen (1984). Taking now the covariant derivative of T.,, we 
obtain 

1 1 1 1 

D Tent = OT et = aay om T Tsu rto P tarso 


= Spata E Foii „[4] T fas tu Ls su Ws ru 
a fru, w PE a = fou, at? = fiu: Fs ai 


In contrast to ,,, this expression is not symmetric in the four indices. To obtain 


868 0. E. BARNDORFF-NIELSEN 


symmetry we introduce 


(2.26) Troin = 1D TF L4] 
=~ hate + brat lA] + tu Ena 


which is a covariant tensor of rank 4. This may be compared to each of the 
“expected tensors” 


(2 27) Tratu = D, Tretl4 | 
ee + LeluL4] + (lns liu + L, slalu + Lt Tium” “16])} 


I! 


and 

(2.28) Myo, = E{0,L0,l 0,1 9,1}. 

The latter may, as appears by differentiation of (2.13) twice, be rewritten as 
(2.29) Mystu = —E{lrstu + lyselul4] + lroleu l3] + Lall [6]} : 


The tensors (2.26) and (2.27) are closely analogous. In particular, they are 
identical for (k, k) exponential models, with common value — E{l stu}. In con- 
trast to this, M,,,,, does not equal the common value of Tossu and T.,,,, for yes 


models. But if instead of the fourth moment of the score vector, i.e, M,,2,. W 
consider the fourth cumulant, i.e., 


K iu = M,stu = Fatale 
then we find that this is also a tensor, that 
K „stu T -E{l retu T Lethu [4] + a bral a + l, slilu + (lhl, = i,gi1,))[6]}, 


and that for (k, k) exponential models 7.,,,, = Tostu = K „stu 

More generally, for m = 2,3,... let K, dents the mth-order cumulant of 
the score vector dl = (0,1,..., FA). From the tensorial nature of the moments 
(2.23) of ðl and from the general formulae relating moments and cumulants, cf. 
for instance Speed (1983) or McCullagh (1984b), we find that K, , isa 
covariant tensor of rank m. Furthermore, writing T, for J, T., for i: and 
defining a and T, r, sisting by 


a Lm +1] 





“m+ 





Fiese m+t = mt+l Dw, Fy r,Lm T 1], 


we have J. = T,,..,.=X,,...,, for (k, k) exponential models and for any 


3. Expansion of cj IPGL. We shall derive an asymptotic expansion of (1.5), 
by Taylor expansion of c|j|?Z in & around a, for fixed value of the auxiliary a. 


LIKELIHOOD AND OBSERVED GEOMETRIES 869 


The various terms of this expansion are given by mixed derivatives [cf. (1.4)] of 
the log model function. It should be noted that for arbitrary choice of the 
auxiliary statistic a the quantity c|j|!/Z constitutes a probability (density) 
function on the domain of variation of ô and the expansions below are valid. 
However, c|j|!/2Z furnishes an approximation to the actual conditional distribu- 
tion of ô given a, as discussed in Section 1, only for suitable ancillary specification 
of a. 

To expand c|j|!/L in ô around w we first write L as exp{/ — 7} and expand l 
in w around 6. By Taylor’s formula, 

A wa 1 
i-l=¥5 e ~ 8)" --- (w — 6)"(4,, «++ 38) 


per? °° 


whence, expanding each of the terms (ð, --- 0,1)(@) around w, 





%0 —1 á 
jele 2 n o (0-0) 
(3.1) pm 
SG e)" e (B= 0)" Bu Badan 


Consequently, writing ô for ô — w and 6": for (ô — wy(ô — w) --- , we have 


(3.2) | i= = the) + ETa C AP + 2 pst) 


+ oriin C A tu t Shas, u t 3frstu) Peers 
Next, we wish to expand log{l JANY- in ô around w. To do this we observe that 
if A is a d X d matrix whose elements a,, depend on w then 
d,log|A| = |A|~* 3,A| 
= a" I Apg, 
where a’ denotes the (r, s)-element of the inverse of A. Furthermore, using 
0,a = -aa aow, 


which is obtained by differentiating a,,a”’ = 5° with respect to w’ and solving 
for a™*, we find 


0,0 ,,log|A| = aa" d,a,,, 0,a,, + AI, 0,a,,. 
It follows that 
log {IA}? = — 3891 se + hra.) 
(3.3) ~ 18 P(t + ee u ag Se Bg Jrs, ju) 


HEP” froi + rs, Dez: a vw; wy Or og 
By means of (3.2) and (3.3) we therefore find 


(3.4) ej = (2r) c@,(& -w J{1+A,+A,+---}, 


870 0. E. BARNDORFF-NIELSEN 


where 9,(°; J) denotes the density function of the d-dimensional normal distri- 
bution with mean 0 and precision (i.e., inverse variance—covariance matrix) j and 
where 


(3.5) Ay = — 485%, et trae) + 28°" (Lee, + Hees) 
and 
A, = hl 882 rotu + fratu + frou, t + hre; tu) 
OPP — 17°) Frost + brat) Pow; u + Foun) 
(3.6) +6 (3h s1u + Bhrat, u + Ofra; tu) 
=ef ia had aa E See} 


+38 y t + A vat) A w + 3 || , 
A, and A, being of order O(n!) and O(n7}), respectively, under ordinary 
repeated sampling. 
By integration of (3.4) with respect to ô we obtain 


(3.7) (Qn)c=14+C,+---, 
where C, is obtained from A, by changing the sign of A, and making the 
substitutions 


87e ab a 
§rstu > jf [3], 
Orstuow > 773 f"[ 15], 
the 3 and 15 terms in the two latter expressions being obtained by appropriate 
permutations of the indices (thus, for example, 578% — j79/* + jejee + jrujst), 
Combination of (3.4) and (3.7) finally yields 
(3.8) c| jT = (ô — o; f){1 +A, + (Ag + Cy) + oak } 


with an error term which in wide generality is of order O(n?) under repeated 
sampling. In comparison with an Edgeworth expansion it may be noted that the 
expansion (3.8) is in terms of mixed derivatives of the log model function, rather 
than in terms of cumulants, and that the error of (3.8) is relative, rather than 
absolute. 

In particular, under repeated sampling and if the auxiliary statistic is (ap- 
proximately or exactly) ancillary such that 


p(4; wla) = p*(4; wja){1 + O(n-*”7)} 
(cf. Section 1) we generally have 
(3.9) p(ô; wla) = 9,(4 — w; D + A, + (A,+C,) + O(n-*/*)}. 


For one-parameter models, i.e., for d = 1, the expansion (3.8) with A,, Az, and 
C, as given above reduces to an expansion derived in Barndorff-Nielsen and Cox 
(1984a). Using that expansion confidence limits for w, valid to order O(n7°”*), 


LIKELIHOOD AND OBSERVED GEOMETRIES 871 


have been derived in Barndorff-Nielsen (1985a). In the former of those two 
papers a relation valid to order O(n~*/*) was established, for general d, between 
the norming constant c of (1.4) and the Bartlett adjustment factors for likelihood 
ratio tests of hypotheses about w. By means of this relation such adjustment 
factors may be simply calculated from the expression for C,. 


EXAMPLE 3.1. Suppose Æ is a (k, k) exponential model with model function 
(2.14). Then the expression for C, takes the form 
wow 2K "ke kt” + Bacco) } ; 
where, for 0, = 0/00" and x(@) = —log a(b), 
= 9,0,...«(8) 


and where «’* is the inverse matrix of k,,. 


= i raptu 
C, 24 {3k stuk K K rat 


Krs... 


From (3.8) we find the following expansion for the mean value of ô: 
E0 = otp tutt, 
where pf is of order O(n~*), „3 is of order O(n~*), and 
-1 
(8.10) Mm HE ae PT a 
Hence, from (3.8) and writing ô’ for ô — w, 
OAL = gq(@ = o = m; PÒL + (Ay = Sheni) +--+} 
= y(O - w — p; rhe + he Ce Dites F ETAR a }, 
where the error term is of order O(n~') and where A™: (+; j) denotes the 


tensorial Hermite polynomial [as defined by Amari and Kumon (1983)], relative 
to the tensor j|,. Using (2.12) we may rewrite the last quantity in (3.11) as 


(3.11) 


-1/3 
(3.12) frs; t + ; rst T f rst E Rpt 
where 
(3.13) Rat F ah i Shas + st; ee 
Since 
(3.14) hr**( 8’; D = 8/7858! — j8” [83] 
we find 


h (8; DR r= 0 
and hence (3.11) reduces to 


= -1/3 
(3.15) cjJ ZL on Pal ô -o — p; Dh a LArt( 5’; D K mete \, 


the error term being O(n7'). 
Suppose, in particular, that the model is an exponential (k, d) model. We may 
then compare (3.15) with the Edgeworth expansion for an efficient, bias adjusted 


872 O. E. BARNDORFF-NIELSEN 


estimate of w given an ancillary statistic, provided by tornua (3.33) and (3.25) 
in Amari and Kumon (1983). It appears hat h(S; D Fi ret Of @. 15) is the 


counterpart of Amari and Kumon’s ia abeh — Hes „h'?h" + H gh hl: 


Thus (3.15) offers some simplification over the pe expression provided 
by the Amari and Kumon paper. 
Note that, again by the symmetry of (3.14), if 


-1/3 


(3.16) f ræl3] =0 


for all r, s, ¢ then the first-order correction term in (3.15) is 0. Furthermore, for 


any one-parameter model .W the quantity Ý with a = — 1 can be made to 


vanish by choosing that parametrization for which w is the geodesic coordinate 
for the — } observed connection. (Note that generally this parametrization will 
depend on * the value of the ancillary a.) An analogous result holds for the 
Edgeworth expansion derived by Amari and Kumon (1983), referred to above. 


The parametrization making the a = — } expected connection T vanish has the 
interpretation of a skewness reducing parametrization, cf. Kass (1984). 


Acknowledgments. I am much indebted to Peter E. Jupp, Steffen L. 
Lauritzen, and Hans Anton Salomonsen for valuable discussions. Helpful com- 
ments from an associate editor and two referees are also gratefully acknowledged. - 


REFERENCES 


AMARI, S.-I. (1982a). Differential geometry of curved exponential families—curvatures and informa- 
tion loss. Ann. Statist. 10 357-385. 

AMARI, S.-I. (1982b). Geometrical theory of asymptotic ancillarity and conditional inference. Bro- 
metrika 69 1-17. 

AMARI, S.-I. (1983). Comparisons of asymptotically efficient tests in terms of geometry of statistical 
structures. Bull. Internat. Statist. Inst. 50 1190-1206, 

AMARI, S.-I. (1984). Differential geometrical theory of statistics—towards new developments. To 
appear in Differential Geometry in Statıstıcal Inference. IMS monograph. 

AMARI, S.-I. (1985). Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics 28. 
Springer, Heidelberg. 

AMARI, S.-I. and Kumon, M. (1983). Differential geometry of Edgeworth expansion in curved 
exponential family. Ann. Inst. Statist. Math. 35 1-24. 

BARNDORFF-NIELSEN, O. E. (1978). Hyperbolic distributions and distributions on hyperbolae. Scand. 
J. Statist. § 151-157. 

BARNDORFF-NIELSEN, O. E. (1980). Conditionality resolutions, Biometrika 67 293-310. 

BaRNDORFF-NIELSEN, O. E. (1983). On a formula for the distribution of the maximum likelihood 
estimator. Biometrika 70 343-365. 

BARNDORFF-NIELSEN, O. E. (19848). On conditionality resolution and the likelihood ratio for curved 
exponential families. Scand. J. Statist. 11 157-170. 

BaRNDORFF-NIELSEN, O. E. (1984b). Differential and integral geometry in statistical inference. To 
appear in Differential Geometry in Statistical Inference. IMS monograph. 

BARNDORFF-NIELSEN, O. E. (19852). Confidence limits from ¢|j|'/?Z in the single-parameter case. 
Scand. J. Statist. 12 83-87. 


LIKELIHOOD AND OBSERVED GEOMETRIES 873 


BARNDORFF-NIELSEN, O. E. (1985b). Inference on full or partial parameters based on the standar- 
dized signed log likelihood ratio. To appear in Biomeirika 73. 

BARNDORFF-NIELSEN, O. E. and BLÆSILD, P. (1984). Combination of reproductive models. Research 
Report 107, Dept. Theor. Statist., Aarhus Univ. 

BARNDORFF-NIELSEN, O. E. and Cox, D. R. (1984a). Bartlett adjustments to the likelihood ratio 
statistic and the distribution of the maximum likelihood estimator. J. Roy. Statist. Soc. 
Ser. B 46 483-495. 

BaRNDORF¥-NIELSEN, O. E. and Cox, D. R. (1984b). The effect of sampling rules on likelihood 
statistics. Internat, Statist. Rev. 52 309-326. 

BARNDORFF-NIELSEN, O. E. and Jupp, P. E. (1984). Differential geometry, profile likelihood and 
L-sufficiency. Research Report 118, Dept. Theor. Statist., Aarhus Univ. 

BARNDORFF-NIELSEN, O. E. and Jupp, P. E. (1985). Profile likelihood, marginal likelihood and 
differential geometry of composite transformation models. Research Report 122, Dept. 
Theor. Statist., Aarhus Univ. 

BLÆSILD, P. and JENSEN, J. L. (1981). Multivariate distributions of hyperbolic type. In Statistical 
Distributions in Scientific Work. (C. Taillie, G. P. Patil and B. A. Baldessari, eds.) 4 45-66. 
D. Reidel, Dordrecht. 

BURRIDGE, J. (1981). A note on maximum likelihood estimation for regression models using grouped 
data. J. Roy. Statist. Soc. Ser. B 43 41-45. 

CHENTSOV, N. N. (1972). Statistical Decision Rules and Optimal Inference. Nauka, Moscow (in 
Russian). (English translation in Translation of Mathematical Monographs 53 (1982). 
Amer. Math. Soc., Providence, R.I.) 

Erron, B. (1975). Defining the curvature of a statistical problem (with application to second order 
efficiency) (with discussion). Ann. Statist. 3 1189-1242. 

JENSEN, J. L. (1981). On the hyperboloid distribution. Scand. J. Statist. 8 193-206. 

Kass, R. E. (1984). Canonical parameterizations and zero parameter-effects curvature. J. Roy. 
Statist. Soc. Ser. B 46 86-92. 

LAURITZEN, S. L. (1984). Statistical manifolds. To appear in Differential Geometry in Statistical 
Inference. IMS monograph. 

McCu.acg, P. (1984a). Local sufficiency. Biometrika 71 233-244. 

MCCULLAGH, P. (1984b). Tensor notation and cumulants of polynomials. Biometrika 71 461-476. 

SHUSTER, J. J. (1968). A note on the inverse Gaussian distribution function. J. Amer. Statist. Assoc. 
63 1514-1516. 

SPEED, T. P. (1983). Cumulants and partition lattices. Austral. J. Statist. 25 378—388. 


DEPARTMENT OF THEORETICAL STATISTICS 
INSTITUTE OF MATHEMATICS 

AARHUS UNIVERSITY 

Ny MUNKEGADE 

DK-8000 AARHUS C 

DENMARK 


The Annals of Statstics 
1986, Vol. 14, No. 3, 874-895 


RECTANGULAR LATTICE DESIGNS: 
EFFICIENCY FACTORS AND ANALYSIS 


By R. A. BAILEY AND T. P. SPEED 


Rothamsted Experimental Station and C.S.I.R.O. 


Rectangular lattice designs are shown to be generally balanced with 
respect to a particular decomposition of the treatment space. Effiaency 
factors are calculated, and the analysis, including recovery of interblock 
information, is outlined. The ideas are extended to rectangular lattice designs 
with an extra blocking factor. 


1. Introduction. The class of incomplete block designs known as rectangu- 
lar lattice designs was introduced by Harshbarger (1946), with further details and 
extensions being given in a subsequent series of papers by Harshbarger (1947, 
1949, 1951) and Harshbarger and Davis (1952). Apart from a contribution by 
Grundy (1950) concerning the efficient estimation of the stratum variances and 
the papers by Nair (1951, 1952, 1953) relating rectangular lattice designs to 
partially balanced designs, little further theoretical discussion of this class of 
designs seems to have occurred. Expositions of the basic results about rectangular 
lattice designs in two and three replicates, as well as tables of designs, can be 
found in Robinson and Watson (1949) and Cochran and Cox (1957). Discussions 
exist in other standard texts on the design and analysis of experiments, for 
example Kempthorne (1952), but, apart from recent contributions by Williams 
(1977) and Williams and Ratcliff (1980), the literature seems to end in the early 
1950’s. [In his recent note, Thompson (1983) uses the results in the present paper, 
as he acknowledges.] A possible explanation of this fact may be the observations 
of Nair (1951, 1953) that every 2-replicate rectangular lattice design is a partially 
balanced incomplete block design with four associate classes, whilst the obvious 
extension of the argument to r-replicate rectangular lattice designs for r > 3 fails 
in general, although the classes of rectangular lattice designs for n(n — 1) 
treatments in n — 1 or n replicates again turn out to be partially balanced. 
Perhaps it was felt that, in not being partially balanced, rectangular lattice 
designs were rather too complicated. 

In his fundamental papers on designed experiments with simple orthogonal 
block structure Nelder (1965a, b) introduced the notion of general balance, this 
being a relationship between the treatment structure and the block structure of 
the design. It is immediate from his definition that all block experiments (in the 
usual sense of the term) are generally balanced for some treatment structure [see 
Houtman and Speed (1983)], although here we might more properly use the term 
treatment pseudo-structure, and when this structure is elucidated for a given 
class of designs they can be regarded as understood and readily analysed. In a 


Received June 1985; revised September 19865. 

AMS 1980 subject classifications. Primary 62K10; secondary 62J10, 05B165. 

Key words and phrases. Analysis of variance, block structure, combination of information, 
efficiency factor, general balance, Latin square, rectangular lattice, resolvable design, stratum, 
treatment decomposition. 


874 


RECTANGULAR LATTICE DESIGNS 875 


later paper, Nelder (1968) showed the importance of general balance in permit- 
ting the straightforward estimation of stratum variances, introducing a method 
equivalent to that which has come to be known as restricted maximum likelihood 
estimation of variances [see Patterson and Thompson (1971) and Harville (1977)]. 
The definition of general balance in block designs is intimately connected with 
the eigenspaces of a certain linear transformation, denoted by Lp in this paper, 
and in this form a number of other authors have recently emphasised the same 
concept [see, for example, Pearce, Calihski, and Marshall (1974), who called the 
eigenvectors of Lp basic contrasts, and Corsten (1976)]. 

In Sections 3 and 4 of this paper we obtain an orthogonal decomposition of the 
space of all treatment contrasts associated with a general r-replicate rectangular 
lattice design. In Section 5 we use this decomposition to identify all the eigen- 
spaces of the linear transformation Lp. An equivalent description of our results is 
that we determine the treatment pseudo-structure relative to which the designs 
are generally balanced; equivalently again, we describe the basic contrasts of the 
design. Using these results, a full analysis, modelled on Nelder’s (1965b, 1968) 
general approach, of rectangular lattice designs is given in Section 6, involving 
the derivation of a fully orthogonal analysis of variance and estimates of the 
stratum variances, and the calculations of estimates of treatment contrasts, 
together with their standard errors. A recursive analysis along the lines of 
Wilkinson (1970) is most satisfactory, as the eigenspaces are orthogonal comple- 
ments of subspaces each of which has a simple formula for its orthogonal 
projection in terms of averaging operators, and so these subspaces can be swept 
out successively in a quite straightforward manner. Our general approach to the 
analysis of designed experiments is framed in vector space terms, similar to that 
used by James and Wilkinson (1971) and Bailey (1981), but in the multistratum 
framework of Nelder’s papers. 

Finally, we use the foregoing ideas to sketch the design and analysis of an 
experiment in which an extra blocking factor was imposed on a rectangular 
lattice design. Two examples are used throughout the paper to illustrate the 
theory. 


EXAMPLE 1. This is a rectangular lattice for 20 treatments in three replicates 
of five blocks of four plots. Although this is an entirely abstract example, there 
being no associated experiment, it illustrates the general theory well because it 
has no special features: the design is not partially balanced, and its construction 
does not use a complete set of mutually orthogonal Latin squares. Tables 1, 3-5, 
7, and 12-16 refer to Example 1. 


EXAMPLE 2. In an experiment into the digestibility of stubble, 12 feed 
treatments were applied to sheep. There were 12 sheep, in three rooms of four 
animals each. There were three test periods of four weeks each, separated by 
two-week recovery periods. Each sheep was fed three treatments, one in each test 
period, During the recovery periods all animals received their usual feed, so that 
they would return to normal conditions before being subjected to a new treat- 
ment. 


876 R. A. BAILEY AND T. P. SPEED 


TABLE 1 
Transversal of a 5 X 5 Latın square 


arw wv 


3 

@ 
1 
5 
2 


a woG@rw 
@Qrwase 
H Ox wa 


It was desired that each treatment should be fed once in each room and once 
in each period. If periods are ignored, a suitable design is a rectangular lattice 
design in which sheep are blocks and rooms are replicates. We shall ignore the 
periods until Section 7, where we show how to deal with this extra blocking 
factor. Tables 9-11 and 18-19 refer to Example 2. 


2. Construction. In this section we review the construction of rectangular 
lattice designs, partly in order to establish our terminology and notation. 

A rectangular lattice design is a resolvable incomplete block design for t 
treatments in r replicates of n blocks of size n — 1, where t = n(n — 1) and 
2 <r < n, for some integer n. We write b for rn, the total number of blocks, and 
N for b(n — 1), the total number of plots. The design has the property that any 
pair of treatments occur together in at most one block. The design is constructed 
from a set of r — 2 mutually orthogonal n X n Latin squares A,,..., A,_ 

A transversal of such a set of Latin squares is defined [see Dénes and 
Keedwell (1974), pages 28 and 331] to be a set of n cells with one cell in each row 
and one in each column, which between them have all the letters of all the 
squares A,,..., A,_». In Table 1 a transversal of a single 5 X 5 Latin square is 
indicated with circles. Transversals do not always exist: Table 2 shows a 4 x 4 
Latin square with no transversal. A sufficient condition for the existence of a 
transversal is the existence of a Latin square A,_, orthogonal to each of 
A,,..., A,_9, for then each letter of A,_, corresponds to a transversal. Such a set 
of mutually orthogonal n x n Latin squares A,,..., A,_, exists whenever n is a 
prime or prime power and r is less than or equal to n [see Dénes and Keedwell 
(1974), page 165]. However, this condition is not necessary, because the square in 
Table 1 has no orthogonal mate. 

It is convenient (although not essential) to permute the ‘rows and columns of 
A,,..., A „g simultaneously so that the transversal lies down the main diagonal. 


TABLE 2 
A 4 X 4 Latın square with no transversal 


Nw oe ee 
Wm Re o 
me bo OO 
eNO 


RECTANGULAR LATTICE DESIGNS 877 


TABLE 3a 
Table 1 with rows permuted 
1 2 3 4 5 
3 5 1 2 4 
2 1 4 5 3 
5 4 2 3 1 
4 3 5 1 2 
TABLE 3b 
Table 3a with letters permuted 
1 6 4 3 2 
4 2 1 5 3 
5 Į 3 2 4 
2 3 5 4 1 
3 4 2 1 5 


This is achieved by moving the ith row to the jth row if the unique transversal 
cell in row i is in column j. It is also convenient to rename the “letters” of each 
square independently so that the letters on the main diagonal are in natural 
order. Tables 3a and 3b show the results of applying these processes to the square 
in Table 1. l 

An n X n square array is drawn. The diagonal cells are left blank, and the ¢ 
treatments are allocated to the remaining cells, as in Table 4. In this example we 
have labelled the treatments A, B,..., T, but we shall usually use w to denote a 
general treatment, to avoid confusion with other symbols. We denote the n 
diagonal cells by i, j,... and the r classifications (that is, rows, columns, letters 
of A,,..., letters of A,_.) by a, b,.... 

We define subsets of the treatments called spokes and fans. A 1-spoke is the 
set of n — 1 treatments in any row; a 2-spoke is the set of n — 1 treatments in 
any column. For a = 3,..., r, an a-spoke is the set of n — 1 treatments in the 
positions of any one letter of square A, _». For a = 1,...,r and ¿i = 1,..., n we 
denote by “,, the unique a-spoke which would naturally go through the ith 
diagonal cell if the diagonal cells were not excluded. For each fixed i, the fan F, 
through the ith diagonal cell is defined to be the union of all spokes through that 


TABLE 4 
Treatment array for Example 1 
* A B C D 
E * F G H 
I J * K L 
M N (0) * P 
Q R S T * 


878 R. A. BAILEY AND T. P. SPEED 


TABLE 5 
Rectangular lattice block design (Example 1) 
(blocks are columns) 





diagonal cell; that is, 
F =P U Pp U UL 


The terminology is suggested by the fact that all spokes in a fan have the 
corresponding diagonal cell in common, while no two spokes in the same fan have 
any further cells in common. In the example given by Tables 3b and 4, we have 


S,, = {A, B,C, D}, 
Sy = {C,G, K,T}, 
Sy, = {D, K, M, S}, 
F= {Q, R,S,T, D, H, L, P, A,G, I,O}. 


The design is now constructed very easily. For a = 1,..., r, the blocks of the 
ath replicate are just the a-spokes. Table 5 shows the (unrandomized) design 
which emerges in this way from Tables 3b and 4. Thus spokes have a genuine 
statistical meaning, as each spoke gives a block of the design. Fans have no direct 
statistical meaning, but they are a combinatorial consequence of the spokes 
which prove useful for the analysis of the design. 

Orthogonal cyclic Latin squares may be constructed by the automorphism 
method of Mann (1942), which is described in Section 7.2 of Dénes and Keedwell 
(1974). If p is the smallest prime divisor of n then p — 1 orthogonal squares are 
obtained, and hence rectangular lattice designs may be constructed for r < p 
(reserving one of the squares for the transversal). The same designs may also be 
constructed as a-designs [Patterson and Williams (1976)]. Let q4, qo,-..,G,—1 be 
any integers such that no two are congruent modulo p and none is divisible by p. 
Without loss of generality we may take q, = 1. The generating a-array is in 
Table 6, in the format used by Patterson and Williams (1976), whose series I, II, 
and IV are all examples of the array shown here. 


3. Decomposition of the treatment space. Let R‘ be the real vector space 
of vectors indexed by the ¢ treatments. We need to find an orthogonal decomposi- 
tion of R‘ that will enable us to analyse data from experiments with the 
rectangular lattice design. To this end, we define certain special vectors in and 
subspaces of R‘. 

Let u be the vector (1,1,...,1). Fora =1,..., r and į = 1,..., n let v, be the 
characteristic vector of the spoke %,,; that is, the w-entry (v,,),, of Va, is 1 if 


RECTANGULAR LATTICE DESIGNS 879 


TABLE 6 
Generators for a-designs which are also rectangular lattice designs 
(entries in the array should be reduced modulo n) 


0 0 0 . 0 

0 1 d2 Gr-1 

0 2 2q = 249,-1 

0 n-2- (n-2%)q ves (n- 2a, 
0 n-1 (n = 1)q . (n= 1)q,- 


w E€ F, and 0 otherwise. Similarly, for i = 1,..., n, let w, be the characteristic 
vector of the fan #,, so that 


W, = Vi, + Vo, + ee +YV,,- 


Let U, be the subspace spanned by u; let U, be the subspace spanned by the fan 
vectors w, let U, be the subspace spanned by the spoke vectors v,,,; and let U, be 
the whole space R’. [Our conventions for labelling the first and last of these 
spaces agree with those used by Throckmorton (1961) and Kempthorne (1982).] 
Then 

U, & U, € U, & U, 


For Example 1 we display each vector in R® in a two-dimensional array 
corresponding to Table 4. Tables 7a and 7b give examples of vectors in U,\ U, 
and in U, respectively. 

The dimension of U, is 1. The space R‘ has an inner product (,) on it defined 
by 


t 
(Z, 2’) = L ZoZo 


w=] 


TABLE 7a 
The vector v; — 2Vo4 + 5Vag 


* 1 1 —1 6 
0 * 0 -2 0 
0 0 * 3 0 
5 0 0 * 0 
0 0 5 -2 * 
TABLE 7b 
The vector w, + 3w; 
* 4 1 1 4 
1 * 1 3 3 
4 1 * 0 3 
1 0 3 * 4 
4 3 3 4 * 


880 R. A. BAILEY AND T. P. SPEED 


We use this to find the dimensions of the spaces U, and U,. Note that 
(Var Voj? = |a N Fajl 


n-1 ifa=b and i=j, 

(3.1) 0 ifa=b and i¥y, 
0 ifa#b and i=j, 

1 ifa#b and i*j, 


so that 
(W, W,) = |FN F| 


(3.2) E r(n-1) ifi=j, 
Tar ifi#j. 


Moreover, L,w, = ru. Suppose that L,\,w, = 0 for some real numbers À, If 
r + n, taking inner products with individual w, shows that A, = --- =A,, and 
hence that A, = --- =A, = 0: thus the fan vectors are linearly independent and 
so U; has dimension n. On the other hand, if r = n then w, = u for i,..., n: thus 
U, = U,. Now suppose that 1,2, a:Va: = 0 for some real numbers i ,,. Taking 
inner products with individual v, shows that there are real numbers @, and 9, 
such that A,, = 6, + ¢, for all a and i. Since 
Var + Vag ttt HVan = U 


for a = 1,..., r, this implies that (L,6,)u + L,¢,w, = 0. Hence U, has dimension 
nr —(r—-l)jifr#n, and nr—(r—-1)—-(n- Dif r=n. 

For Example 1, Equations (3.1) and (3.2) are demonstrated in Tables 7a and 
7b, respectively. For example, the six entries equal to 4 in Table 7b correspond to 
the elements of F, N F;. In this case the five fan vectors form a basis for U,; 
while a basis of U, consists of u and all but three spoke vectors, one being omitted 
for each classification. 

We can form the orthogonal complements of the U-subspaces, and thus obtain 
the subspaces that really interest us. Specifically, we put 


y= U, 

V, = the orthogonal complement of U, in U,, 
V, = the orthogonal complement of U, in U,, 
V, = the orthogonal complement of U, in U,. 


Then V, is spanned by vectors of the form w, — w,; while V, is spanned by 
vectors of the form v,, — Vp Now R* is the orthogonal direct sum 


R'=V.@V,0 V, ® V, 


We record the important facts about this decomposition in Table 8. 

In two special cases this decomposition can be described in simpler terms. If 
r= n then the set {A,,..., A,_2} is only one square short of a complete set of 
mutually orthogonal Latin squares. Thus there exists a (unique) Latin square 


RECTANGULAR LATTICE DESIGNS 881 





TABLE 8 

Decomposition of the treatment subspace 
subspace A V V, V 
description mean contrasts contrasts orthogonal 

between fans between spokes to spokes 
within fans 

dimension (r < n) 1 n-1 (n-1Xr-1) (n-r\n-1)-1 
dimension (r = n) 1 0 (n- 1)? n-2 


A „-ı orthogonal to all the others, by Theorem 1.6.1 of Rhagavarao (1971). One 
letter of A,_, must correspond to the transversal. Each other letter of A,_, 
occurs just once in each a-spoke, for each classification a. Hence the contrasts 
between these n — 1 other letters are orthogonal to spokes, and so they form the 
whole space V,. Since V, is null in this case, V, must consist of all treatment 
contrasts which are orthogonal to the letters of A,,_,. Thus the treatments have 
the simple nested structure (n — 1) > n [in the notation of Nelder (1965a)], and 
the treatment space decomposition is the familiar one into mean, between letters 
of A,_, and within letters. 

If r=n-—1 and n + 4, the results of Shrikhande (1961) and Bruck (1963) 
show that there is a unique complete orthogonal set {A,,..., A,_,} containing 
the original set {A,,..., A,_3} and that the original transversal corresponds to a 
letter of one of the two extra squares, say A,,_.. The same result is true even 
when n = 4, because the existence of the original transversal prevents A, from 
being isotopic to the square in Table 2, which is the only 4 x 4 Latin square (up 
to isotopy) which is not uniquely embeddable in a complete set of mutually 
orthogonal Latin squares [isotopy classes are also called transformation sets (see 
Fisher and Yates (1934))]. The treatments now have the simple crossed factorial 
structure Q, X Q, where the levels of Q, are the n — 1 other letters of A „— and 
the levels of Q, are the n letters of A,,_,. Now V, is the main effect of Q,; while 
V, is the main effect of Q, and V, is the @,Q, interaction. 

Example 2 has r = n — 1 = 3. The rectangular lattice design is constructed 
from the set of mutually orthogonal 4 X 4 Latin squares in Table 9 : the rows, 
columns, and letters of A, are the three classifications; letter 1 of A, gives the 
transversal; the remaining letters of A, and A, give the 3 X 4 factorial treat- 
ment structure described above and shown i in Table 10. The design is that shown 
in Table 11, ignoring periods. 

In both these special cases the factorial treatment decomposition has no direct 
statistical meaning, but is merely an aid to the analysis. The factors Q, and Q, 
are entirely analogous to the pseudo-factors used in the construction and analysis 
of square lattice designs [Yates (1936)]. 


4. Treatment projection. Let z be a vector in R‘. In order to use the spaces 
Vo Vp Va and V, in the analysis of an experiment we need to know how to 
calculate the projections of z onto these spaces. This is done in terms of the 


882 R. A. BAILEY AND T. P. SPEED 


TABLE 9a 
Three mutually orthogonal 4 X 4 Latin squares 









A, (*l” gives transversal; 
other letters are 
levels of Q,) 







A, (letters are 
levels of Q,) 





A, (gives 3rd replicate) 


TABLE 9b 
Array of twelve treatments for Example 2 
* A B C 
D * E F 
G H * I 
J K L * 
TABLE 10 


3 X 4 factorial structure for Example 2 





level of Q, 2 3 4 2 4 3 3 4 2 4 


treatment A B C D E F G H I J kK L 
3 
levelofQ, 2 3 4 3 1 2 4 3 1 2 1 





TABLE 11 
Design which ıs not generally balanced 





room 1 3 
sheep 1 2 3 4 9 10 il 2 
1 B D I L A J C H 
ont a (2 C E H K G B D F 
3 A F G J E I K L 
following totals: 


grand total G(z) = È z„, 


spoke total Sa (2) = 2 {z.: vE Sa) = (Z, Vai) 


fan total F(z) = } {z,: wo E F,} = (z,w,). 


RECTANGULAR LATTICE DESIGNS 883 


TABLE 12 
A particular vector z in R? 
* 1 3 2 3 
6 * 5 9 4 
5 2 * 6 7 
4 5 8 * 1 
2 4 2 5 * 
It is immediate that 
(4.1) L Salz) = G(z), 
(4.2) L Sa (z) = F(z), 
(4.3) LF (z) = rG (z). 


Define the fan totals vector f(z) and the spoke totals vector s(z) by 
f(z) = UF (z)w,, 
E 


s(2) = È LX Sa (2)Yar: 


We also need the grand totals vector g(z), all of whose entries are equal to G(z). 
Continuing our Example 1, a vector z is shown in Table 12. Its spoke totals are 
in Table 13: the column margins are the fan totals, and the row totals are all the 
grand total. The vectors f(z) and s(z) are shown in Table 14. 
We aim to give the projections of z onto the spaces V,, V;, V,, and V, in terms 
of f(z), s(z), and g(z). The necessary calculations are contained in the following 
two lemmas. 


LEMMA 1. 
(i) (8(Z), Va.) = nS, (z) + (r = 1)G(z) Bi F(z), 
(ii) (£(z), Va) = (n — r)F(z) + r(r — 1)G(z), 
(iii) (f(z), w,) = r(n — r)F(z) + r?(r - 1)G(z). 
TABLE 13 
Spoke totals of z 
i 1 2 3 4 5 total 

row (a = 1) 15 24 20 18 13 90 
column (a = 2) 17 18 18 22 16 90 
letter (a = 3) 13 15 13 20 29 80 
fan totals 45 57 51 60 57 270 


884 R. A. BAILEY AND T. P. SPEED 


TABLE 14 
fan totals vector f(z) spoke totals vector s(z) 

* 159 156 156 159 * 62 53 50 45 
162 * 153 174 165 61 * 55 75 52 
153 153 * 168 168 66 51 * 57 55 
162 168 168 * 162 50 49 65 * 46 
153 174 165 162 * 43 51 46 48 * 





Proor. To simplify the expressions, we omit “(z)”, the vector z being 
understood. 


(i) (B, Vai) = LDS i vey Yar) 
= (n—-1)S,, + 2 LS,, (by (3-1)) 
= (n— 1)8,, + 2 (G - Sa) (by (4.1)) 


= nS,, T (r = 1)G = YS, 
b 


= n&,, + (r = 1)G -F, (by (4.2)). 
(ii) (f, Var) = EW Va) 


=(n-1)R+(r-) YUE (by (3.1)) 


jut 
=(n-r)F+(r-1)}F, 
J 
=(n-r)F+r(r-1)G@ (by (4.3)). 
(iii) Summing the equation in (ii) over all the spokes in ¥, gives 
(f,w,) =r(n-r)F,+r?(r-— 1)G. m) 


LEMMA 2. The orthogonal projections of z onto U,,U;, U,, U,» respectively, 


g(z) f(z) ——(r— 1) a2) 
n(n-— 1)’ r(n-r) (n-1)(n-7r)’ 
s(z) tœ) — (r- Del) 
n n(n-r) (n-1)(n—-r)’ 
when r + n. When r = n then U, = U, and the orthogonal projection of z onto 
U, is 
a(z) (n— 2)g(z) 
n n(n-1) ~“ 


RECTANGULAR LATTICE DESIGNS 885 


Proor. Put x= [r(n- r) Ħ — (r - D[(n — 1Xn — r) g when r¥n. 
Since f and g are both sums of fan vectors, x E U,. Thus it suffices to show that 
z — x is orthogonal to U,. This is so if (z — x,w,) = 0 for each fan F, By 
Lemma 1(iii) and (3.2), 
r(n-r)F+r%(r-1G r(n-1)(r-1)G 

r(n-r) (n-1)(n-7r) 

Similarly, put y = n`ts + [n(n — r) `f — (r — DEn — 1Xn — r)] "1g. Then 
y € U,, because s, f, and g are all sums of spoke vectors, so it suffices to show 
that (z — y,v,,) = 0 for all spokes %,,. Lemmas 1(i) and (ii) show that (y, v,,) is 
equal to 

nSa + (r-DG-F P (n-r)F+r(r-1G (n-1)(r-1)G 
n n(n-r) (n-1)(n-r)’ 
which is S,,, which is (z, v,,). 
Now let r = n and put y = n“'s — (n — 2)[n(n — 1)]~'g. Then 


ORE ee SOS eae 





(x, W,) = = F = (z,w,). 





so that y € U, and z — y is orthogonal to U,. O 
Now subtraction gives the orthogonal projection of z onto V,, V;, V,, V.. 


THEOREM 1. Let T,,T,,T,, T, be the operators of orthogonal projection from 
R‘ onto V, V,, V, V., respectively. Then, for all z in R‘, 


g(z) 
Tz n(n- 1)’ 
f(z) rg(z) 
Tz = - > _SCOUwhen rr + nand zero otherwise, 
r(n-r) n(n-r) 
f 
n rn 


T.z = z — (T,z + T,z + T,z). 


In Example 1 we have n = 5 and r = 3, so T,z = g(z)/20; T,(z) = f(z)/6 — 
3g(z)/10; T,z = s(z)/5 — f(z)/15, and T,z is best obtained by subtraction. For 
the particular vector z shown in Table 12, these four components of z are shown 
in Table 15. The orthogonality of the decomposition may be verified by noting 
that 


Dz? + Tz? + Tz? + IT,2zl? 
= 405 + 24 + 47.2 + 21.8 = 498 = |iz||?. 


886 R. A. BAILEY AND T. P. SPEED 


TABLE 15 








5. General balance. The block structure of a rectangular lattice design is 
the double nested classification of plots within blocks within replicates. This is 
one of the simple orthogonal block structures defined by Nelder (1965a). In what 
follows we retain the notation of Nelder (1965a, b, 1968) and Bailey (1981) as far 
as possible. 

Let R” be the real vector space associated with the N plots. Each grouping of 
the plots according to the block structure defines an averaging operation P on 
RN. In our case there are four averaging operators: the grand mean averaging 
operator P, = J/N, where J is the all-l’s matrix; the replicates averaging 
operator Pp; the blocks averaging operator Pp; and the identity P, = I. Nelder 
(1965a) showed that there is an orthogonal direct sum decomposition ® ,W, of R 
such that each W, is an eigenspace of every P. Let C, be the operator of 
orthogonal projection from R™ onto W,. Nelder (1965a) showed that each C, is a 
linear combination of the P’s with integer coefficients: Speed and Bailey (1982) 
gave explicit formulae for these coefficients. In our case we have 


C= P; Cr = PR- P, 
C, = P} — Pp, C, = P, — Pp. 


The spaces W, are called strata: they play an important role in analysis of 
variance [see Nelder (1965b) and Bailey (1981)]. Our covariance model for the 
data vector y is 


(5.1) Cov(y) = EC, + ExCR + §Cp F ELC, 


for unknown scalars £., r, pg, and &,. 

Denote by X the N x ¢ design matrix; that is, X,,, is 1 if plot p receives 
treatment w and 0 otherwise. For each stratum W,, the matrix L, defined by 
L, = X’C,X is called the information matrix for that stratum. For designs with 
equal replication r, we have L, = rT,. If L, = 0 there is no information about 


RECTANGULAR LATTICE DESIGNS 887 


treatments in stratum W,. Strata, other than W,, for which L, + 0, are called 
effective strata. 

Suppose that @,V, is an orthogonal direct sum decomposition of Rt. Nelder 
(1965b) defined an equally replicated design to be generally balanced with 
respect to this treatment decomposition if each V,' is an eigenspace of every 
information matrix; that is, there are numbers A,, such that L, = LoAely, 
where T, denotes orthogonal projection onto V;. We have 0 <A,, < r for all a 
and 6; and £,A,9= 7 for all 8. The quantity À „a/r is the efficiency factor for 
treatment term V; in the stratum W, In a simple block design with blocks 
stratum Wg, examination of the trace of Lg shows that yA pgẹdim(V;)/r = 
b/r — 1, the so-called loss of information due to blocks. 

Houtman and Speed (1983) have shown that in any design with only two 
effective strata there must be some decomposition ® V; of R! with respect to 
which the design is generally balanced. However, the decomposition may not be 
easy to find, use or interpret. Our claim is that a rectangular lattice design is 
generally balanced with respect to the treatment decomposition given in Sec- 
tion 3. 


LEMMA 3. Fora=1,...,randi=1,...,n 
X’P, Xv,, = (nv,, — W, + (r— 1)u)/(n - 1). 


Proor. If @ is any block and v is any vector in R‘ then the entries of Ps Xv 
for the plots in # are all equal to the average of the entries of v for those 
treatments which occur in #. If v = v,, and # consists of ,, then this average 
is ees to (Vav Vg,)/(m — 1). Denote the characteristic vector of this block by 

y Then 


(n = 1)P, Xvi, =} È} <v Vav Voj) Xaj> 


Since X’x,, = v,, we have 
by bj 


(n g 1)X’P, Xv, =È LAW Vas» Voy) Voy 


= (n — 1)vq, + 2 (u = va) (by (3.1)) 
=nv,,+ (r- 1)u — w,. o 


THEOREM 2. Rectangular lattice designs are generally balanced with respect 
to the treatment decomposition given in Section 3. 


Proor. We always have L,u = ru, and L,z = 0 whenever z is orthogonal to 
u. By definition of replicate, ’ ’PpXz = rg(z)/n(n — 1) = X’P, Xz, so Lp = 0. 
Moreover, Lz = X’P,X — X’PpX, and so 
L,(v,, = Vox) = n(n = 1) "(Was E Vin) 
by Lemma 3. Since V, is spanned by vectors of the form v,, — Va, this shows that 


888 R. A. BAILEY AND T. P. SPEED 








TABLE 16 
Efficiency factors of a rectangular lattice design 
treatment subspace 
7, v 7, vy 
stratum 
mean W, 1 0 0 0’ 
replicates Wp 0 0 0 0 
blocks W, 0 iow a 0 
# r(n-1) r(n-1) 
n(r- 1) m-r-n 
plots W, 0 ——— 1 
r(n- 1) r(n- 1) 


V, is an eigenspace of Lp with eigenvalue À p, = n/(n — 1). Similarly, Lemma 3 
shows that 


L,(w, Ga w,) = (n = r)(n = 1) '(w, jäi w,), 


so V, is an eigenspace of L, with eigenvalue À g; = (n — r)/(n — 1). Whether or 
not r = n, Table 8 now shows that À ,,dim(V,) + À ,,dim(V;) = b — r, so there 
can be no further nonzero eigenvalues in the blocks stratum. Thus V, must be an 
eigenspace of Lp with Àp, = 0. 

By the result of Houtman and Speed (1983), the spaces V,, V,, V, are also 
eigenspaces of L,. O 


The eigenvalues in stratum W, are calculated by subtraction. Division by r 
gives the efficiency factors, which are shown in Table 16, which is laid out like the 
table in Section 4.2 of Nelder (1968). 

Block designs are often classified by a single measure of efficiency: the 
harmonic mean of the efficiency factors (taking account of multiplicity) in 
stratum W,. It follows from Tables 8 and 16, that, whether r = n or r < n, the 
harmonic mean éfficiency factor for a rectangular lattice design is 


n(r—1)(m—-r-—n)(n?-—2n-1) 
(r — 1)?n?(n? — n- 1) — r?(n — 1} + m(r—- 1) 





This efficiency factor is proportional to the reciprocal of the average variance of 
the intrablock estimates of simple treatment differences, and so may also be 
obtained from this average variance, which is given by Williams (1977, page 413). 


6. Analysis. Since rectangular lattice designs are generally balanced, their 
analysis follows the pattern described by Nelder (1965b, 1968), Wilkinson (1970), 
and James and Wilkinson (1971). In this section we specialize their results to 
rectangular lattice designs, retaining most of Nelder’s notation. We outline the 
procedure for fitting the model, deriving a complete analysis of variance, estimat- 
ing the stratum variances $p, g, and ¢,, and obtaining minimum variance 


RECTANGULAR LATTICE DESIGNS 889 


unbiased linear estimates (with estimated weights) of arbitrary treatment con- 
trasts, together with their estimated variances. 

Let t be the ¢ X 1 vector of individual treatment effects and let y be the N x 1 
vector of observations. If A,, + 0, the treatment effect Tt is estimated in 
stratum W, by h,,, where h.o = TyX’C,y/A,9. The contribution of treatment 
term V; to the fitted value in stratum W, is C,Xh.,, with the sum of squares 
À „olb „oll. Thus the overall fitted value in stratum W, is L4C,Xh,,, where £4 
denotes summation over those @ for which \,, # 0. The residual sum of squares, 
RSS,, in stratum W,, and its number of degrees of freedom, d,, are obtained by 
subtraction: 


(6.1) RSS, = y’C,y — E Aalha’, 
(6.2) d, = dim(W,) - È} dim( V). 


Thus we obtain the analysis of variance shown in Tables 17a (r < n) and 17b 
(r=n). 

If the stratum variances ¢„, are known, we put wọ = L,A,,/t, and define 
weights Wig by Wao = À a9/ ÈW. The weighted effect corresponding to treatment 
term V; is X Wgh ap, and the overall weighted fitted value t is Lo} Wgh ag. lf x is 
any treatment contrast (that is, x € Rf and (x,u) = 0) then the minimum 
variance unbiased linear estimate of (x,t) is (x,t), with variance LolfT px! |?/we. 


TABLE 17a 
Analysis of variance whenr < n 





source of EMS 
stratum variation df = 
1 y'C,y rT? + £u 
replicates r-1 y’Cry br 
Apy IIT) tl? 
blocks V n-1 py Ib sy l? cena + fe 
A poll T.tll? 
7 S 2° gee 
V, (n - 1Xr- 1) À palh pall (n-1)X(r-1) tfs 
total r(n— 1) Ca¥ 
Ap IIT) tI]? 
7 2 fee 
pisa Vj naa rebel eet ee 
A, {Tt 
V, (n- 1r- 1) dalhat = ag, 
, (n—-1)(r- 1) 
V, 1)-1 À 2 Reel Tet? + 
7, (n-r\n-1)- wilh, || (n-r)(n-1)-1 i 
error a(rm ~ 2r —n +1) 41 RSS, fe 


total mn(n — 2) y’C.y 


890 R. A. BAILEY AND T. P. SPEED 


TABLE 17b 
Analysis of variance whenr = n 
source of 
stratum variation df 88 EMS 
mean 1 y'Qy rT? + én 
replicates r-1 y’Cry fp 
AnllT, tli? 
blocks V, (n -= 1} Asilbal? -7 tts 
(2-1) 
error n—1l RSS, n 
total n(n — 1) y'’Cay 
À lT tll? 
plots V, (n- 1) A alh all? a +§, 
(n-1) 
A allt? 
v, n-2 deelfbreell? ei ? 
n-2 
error (n — 1)(n? — 2n — 1) RSS, i 
total n(n -2) y’C.y 


Usually the stratum variances $, are not known. If d, # 0 then RSS,/d, 
provides an unbiased estimate of ¢,, but in general such estimates are based on 
too few degrees of freedom, because one or more treatment terms have been fitted 
and removed in more than one stratum. For a rectangular lattice design with 
r < n there is no such estimate of £p, because dg = 0. 

The solution to this difficulty is to estimate the stratum variances and the 
weights simultaneously. With the weighted fitted value t given above, the sum of 
squares, R „ for the residual in stratum W, is given by 


(6.3) Ra T RSS, + EAn Dupe, (bag = ge, Bao = hig), 
8 B yY 

with expected value d/£,, where 

(6.4) di = dim(W,) — Li "weedimn( Vo). 


Equating observed and expected values of the R, gives a set of equations in the 
a As Nelder (1968) observed, (6.3) simplifies considerably when there are only 
two effective strata. Thus for rectangular lattice designs we obtain the following 
equations for p and &,: 


RSS, + DAB — bul? = ba] r(n sys Ewaesim( ¥)] 


RSS, + EA wiol g — hyl’ = [rata - 2) - E wail V)|. 


Note that RSS, is zero when r < n, and that the weights w,, also involve the 
unknown $,- However, these equations may be solved, iteratively if necessary, to 


Sans 


RECTANGULAR LATTICE DESIGNS 891 


give us estimates $p and $, which, under normality, correspond to the so-called 
restricted maximum likelihood estimates, and these may be used to give the best 
available estimates of linear combinations (x,t) and the estimated variances of 
those estimates. 

It is clear that the analysis depends on the availability of the projection 
operators C, and T,. The former are quite standard, and correspond to fitting 
and removing the grand mean, replicate means, and block means. The latter are 
given by the fan and spoke totals, and so are straightforward to calculate, even 
by hand. If the statistical programming language GENSTAT is used, spoke 
totals are automatically calculated if r treatment pseudo-factors are declared, 
one for each classification: the levels of the ath pseudo-factor are the a-spokes, 
An alternative strategy is to input r copies of the data and use just two 
treatment pseudo-factors, FAN and SPOKE. In the ath copy, treatments in 
spoke Z, are declared to have level i of FAN and level a of SPOKE. The 
treatment declaration FAN/SPOKE ensures that all the correct major calcula- 
tions are done, using the sweeps of Wilkinson (1970), although minor adjust- 
ments have to be made to the output to allow for the multiple copies. Thompson 
(1983) explains this method, and its difficulties, in more detail, using the general 
methods of Thompson (1984), and shows that this type of pseudo-factorial 
structure is also useful for diallel experiments. 

Thus, apart from the use of estimated weights because the stratum variances 
are in general not known, a completely satisfactory analysis of any rectangular 
lattice design can be made once the operators T, are available. Given these, the 
analysis is analogous to that of a balanced incomplete block design with recovery 
of interblock information. 

Williams and Ratcliff (1980) gave a procedure for the analysis of rectangular 
lattice designs which differs from ours in two respects. In the first place, their 
covariance model is of the form 


Cov|(I — Pr)y] = YsPp + Y.l, 


which differs from our equation (5.1). Secondly, our iterative analysis ensures 
that the final estimates of §,,¢, and the treatment effects are consistent with 
each other, while the Williams—Ratcliff procedure, which is based on that given 
by Yates (1940) and Cochran and Cox (1957, Section 1.3), is, roughly speaking, 
only the first cycle of the restricted maximum likelihood analysis of Patterson 
and Thompson (1971). The differences between these methods, which apply not 
only to rectangular lattice designs, will be discussed in more detail elsewhere. 


7. Rectangular lattices with cross-blocking. The foregoing ideas may be 
extended to a more complicated block structure. 

In Example 2 we have so far ignored the periods. However, it was desirable 
that each treatment should be fed once in each period. The experimenter 
concerned found that, for the rectangular lattice design constructed at the end of 
Section 3, the treatments could be permuted within sheep so that each treatment 
occurred once in each period: his proposed design is shown in Table 11. 


892 R. A. BAILEY AND T. P. SPEED 


Unfortunately, this design takes no account of the grouping of the 36 experi- 
mental units into nine room-periods: each room-period consists of the four 
observations made in the same test period in the same room. In the notation of 
Nelder (1965a), the block structure is 

3 periods x (3 rooms —> 4 sheep). 


The stratum projection matrices are given by 


C,=P,, 
Cr = Pr- P, 
Cp =P,- P,, 

Crp = Pre- Pp— Ppt P,, 
Cs = Ps — Pr, 


C, = P, — P, — Prp + Pr, 


where, for example, Ppp is the averaging matrix for room-periods. Although 
Vo Vp Va and V, are eigenspaces of C,, Cr, Cp, and Cg, they are not eigen- 
spaces of Cpp and C, because the block design given by the room-periods 
alone is not in any sense balanced with respect to the treatment decomposition 
V, ® V; @® V, ® V, Thus the design is not generally balanced. 

However, it is possible to permute the treatments given to each sheep so that 
each treatment occurs once in each period and the design is generally balanced. 
This may be done for n(n — 1) treatments in the simple orthogonal block 
structure 


(n — 1) periods x [(n — 1) rooms > n sheep] 


as follows. Ignoring periods, the design is constructed from a set of mutu- 
ally orthogonal Latin squares A,,..., A, _,, a8 in Section 2. A supplementary 
(n — 1) X (n — 1) Latin square A is needed, whose letters are the remaining 
letters of A, 2. Let 6,, be the letter in row a and column p of A. Then the 
treatment in the pth period and the ith animal of the ath room is the unique 
treatment which is in spoke “, and in letter 5,, of A, ». In our particular 
example we may take the supplementary square A shown in Table 18: the 
resulting design is in Table 19. 

In the notation of Section 3, V, is the main effect of @,, where the levels of Q, 
are the remaining letters of A „~. By our construction, Q, is completely con- 
founded with room-periods, while all treatment vectors which are orthogonal to 
@, are also orthogonal to room-periods. Hence the efficiency factors for this 
extension of the rectangular lattice design are those shown in Table 20. 


TABLE 18 
Supplementary Latin square 
2 3 4 
3 4 2 


4 2 3 


RECTANGULAR LATTICE DESIGNS 893 


TABLE 19 
Generally balanced design for [ pertods X (rooms —> sheep)| 








room 1 3 
sheep 1 2 3 4 9 10 li 12 
2 1 A D I L H J C E 
timi 
period | 2 B F G K L I D A 
8 C E H J F B K G 
TABLE 20 
Efficiency factors of an extended rectangular lattice design 
treatment subspace 
Va V; = Q, V= 2,9, — v= Q, 
stratum 
mean W, 1 0 0 0 
rooms Wp 0 0 0 0 
periods Wp 0 0 0 0 
room-periods Wpg 0 0 0 1 
1 n 
sheep W, 0 —_——> — 0 
: (n- 1) (n~1) 
i n(n — 2) n?-3n+1 
units W, 0 0 


(n-1% (n-1)" 


Acknowledgments. The contribution of T.P.S. to the work reported here 
was carried out whilst he was a visitor at the Indian Statistical Institute in 
Calcutta, and he would like to thank the then Acting Director, Dr. A. Maitra, 
and other staff and scholars, for providing such an enjoyable environment for 
study. Part of R.A.B.’s contribution was made while she was visiting the 
Mathematics Department of the University of Western Australia, to whose staff 
she would like to extend similar thanks. We are grateful to A. Grassia 
of C.S.I.R.O., Perth, for drawing our attention to the problem described in 
Example 2. 


REFERENCES 


BALLey, R. A. (1981). A unified approach to design of experiments. J. Roy. Statist. Soc. Ser. A144 
214-223. 

Bruck, R. H.-(1963). Finite nets. II. Uniqueness and embedding. Pacyfic J. Math. 13 421-457. 

COCHRAN, W. G. and Cox, G. M. (1957). Experunental Designs, 2nd ed. Wiley, New York. 

CORSTEN, L. C. A. (1976). Canonical correlation in incomplete blocks. In Essays ın Probability and 
Statistics. A Volume ın Honour of Prof. Junjiro Ogawa (S. Ikeda et al., eds.) 125-154. 
Shinko Tsusho Co. Ltd., Tokyo. 

DÉNES, J. and KEEDWELL, A. D. (1974). Latın Squares and Thew Applications. English Universities 
Press Limited, London. 


894 R. A. BAILEY AND T. P. SPEED 


FISHER, R. A. and Yares, F. (1934). The 6 X 6 Latin squares. Proc. Cambridge Philos. Soc. 30 
492-507. 

GRUNDY, P. M. (1950). The estimation of error in rectangular lattices. Biometrics 6 25-33. 

HARSHBARGER, B. (1946). Preliminary report on the rectangular lattices. Biometrics 2 115-119. 

HARSHBARGER, B. (1947). Rectangular Lattices. Virginia Agricultural Experiment Station Memoir 1. 

HARSHBARGER, B. (1949). Triple rectangular lattices. Biometrics § 1-13. 

HARSHBARGER, B. (1951). Near balance rectangular lattices. Virguuta J. Ser. 2 13-27. 

HARSHBARGER, B. and Davis, L. L. (1952). Latinized rectangular lattices. Biometrics 8 73-84. 

HARVILLE, D. A. (1977). Maximum likelihood approaches to variance component estimation and to 
related problems (with discussion). J. Amer. Statist. Assoc. 72 320-340. 

HOUTMAN, A. M. and SPEED, T. P. (1983). Balance ın designed experiments with orthogonal block 
structure. Ann. Statist. 11 1069-1085. 

JAMES, A. T. and WILKINSON, G. H. (1971). Factorization of the readual operator and canonical 
decomposition of nonorthogonal factors in the analysis of variance. Biometrika 58 
279-294, 

KEMPTHORNE, O. (1952). The Design and Analysis of Experiments. Wiley, New York. 

KEMPTHORNE, O. (1982). Classificatory data structures and associated linear models. In Statistics 
and Probability: Essays ın Honor of C. R. Rao (G. Kalhanpur, P. R. Krishnaiah and 
d. K. Ghosh, eds.) 397-410. North-Holland, Amsterdam. 

Manny, H. B. (1942). The construction of orthogonal Latin squares. Ann. Math. Statst. 13 418-423. 

Narr, K. R. (1961). Rectangular lattices and partially balanced incomplete block designs. Biometrics 
7 145-154. 

Nair, K. R. (1952). Analysis of partially balanced incomplete block designs illustrated on the simple 
square and rectangular lattices. Buometrics 8 122-155. 

Nam, K. R. (1953). A note on rectangular lattices. Biometrics 9 101-106. 

NELDER, J. A. (1965a). The analysis of randomized experiments with orthogonal block structure. I. 
Block structure and the null analysis of variance. Proc. Roy. Soc. London Ser. A 283 
147-162. 

NELDER, J. A. (1965b). The analysis of randomized experiments with orthogonal block structure. II. 
Treatment structure and the general analysis of variance. Proc. Roy. Soc. London Ser. A 
283 163-178. 

NELDER, J. A. (1968). The combination of information in generally balanced designs. J. Roy. Statist. 
Soc. Ser. B 30 303-311. 

PATTERSON, H. D. and THompson, R. (1971). Recovery of inter-block information when block sizes 
are unequal. Biometrika 58 545-654. 

PATTERSON, H. D. and Wruuiams, E. R. (1976). A new class of resolvable incomplete block designs. 
Bumetrika 63 83-92. 

PEARCE, S. C., CALINSKI, T. and MARSHALL, T. F. DE C. (1974). The basic contrasts of an 
experimental design with special reference to the analysis of data. Biometrika 61 449-460. 

RAGHAVARAO, D. (1971). Constructions and Combinatorial Problems in Design of Experiments. 
Wiley, New York. 

Rosinson, H. F. and Watson, G. S. (1949). Analysis of simple and triple rectangular lattice designs. 
North Carolina Agricultural Experimental Station Tech. Bull. 88. 

SHRIKHANDE, S. S. (1961). A note on mutually orthogonal latin squares. Sankhyd Ser. A 23 115-116. 

SPEED, T. P. and BAILEY, R. A. (1982). On a class of association schemes derived from lattices of 
equivalence relations. In Algebraic Structures and Applications (P. Schultz, C. E. Praeger 
and R. P, Sullivan, eds.) 55-74. Marcel Dekker, New York. 

THOMPSON, R. (1983). Diallel crosses, partially balanced incomplete block designs with triangular 
association schemes and rectangular lattices. Genstat Newsletter 10 16-32. 

THOMPSON, R. (1984). The use of multiple copies of data in forming and interpreting analysis of 
variance. In Experimental Design, Statistical Models, and Genetic Statistics. Essays in 
Honor of Oscar Kempthorne (K. Hinkelmann, ed.) 155-173. Marcel Dekker, New York. 

THROCKMORTON, T. N. (1961). Structures of classificatory data. Ph.D. thesis, Iowa State Univ. 

WILKINSON, G. N. (1970). A general recursive algorithm for analysis of variance. Biometrika 57 
19-46. 


RECTANGULAR LATTICE DESIGNS 895 


WILLIAMS, E. R. (1977). A note on rectangular lattice designs. Biometrics 33 410-414. 

WILLIAMS, E. R. and RatciirF, D. (1980). A note on the analysis of lattice designs with repeats. 
Biometrika 67 706-708. 

Yares, F. (1936). A new method of arranging vanety trials involving a large number of varieties. J. 
Agric. Sct. 26 424—455. 

YATES, F. (1940). The recovery of inter-block information in balanced incomplete block designs. Ann. 
Eugenics 10 317-325. 


Statistics DEPARTMENT CS.LR.O. 

ROTHAMSTED EXPERIMENTAL STATION DIVISION OF MATHEMATICS AND 
HARPENDEN STATISTICS 

HERTFORDSHIRE AL5 2JQ G.P.O. Box 1965 

ENGLAND CANBERRA ACT 2601 


AUSTRALIA 


The Annab of Statwtics 
1986, Vol 14, No 3, 896-906 


CONSERVATIVE CONFIDENCE BANDS IN 
CURVILINEAR REGRESSION? 


By DANIEL Q. NAIMAN 


The Johns Hopkins University 


This paper gives a method for constructing conservative Scheffé-type 
simultaneous confidence bands for curvilinear regression functions over finite 
intervals. The method is based on the use of a geometric inequality giving an 
upper bound for the uniform measure of the set of points within a given 
distance from y, an arbitrary piecewise differentiable path with finite length 
in S*-', the unit sphere in R*. The upper bound 1s obtained by “straighten- 
ing” the path so that it lies in a great circle in S*~', 


1. Introduction. This paper gives a method for constructing conservative 
Scheffé-type simultaneous confidence bands for curvilinear regression functions 
over intervals. The method is based on the use of a geometric inequality 
(Theorem 3.1) giving an upper bound for the uniform measure of the set of points 
within a given distance from y, an arbitrary piecewise differentiable path with 
finite length in S*~!, the unit sphere in R*. 

Consider the curvilinear regression model in which we observe 


k 
(1.1) Y= È bhl) +e, i=1,...,7, 
= 


where the regression coefficients b, are unknown, the f, are known functions, the 
design points x, are known, and the random variables e, are iid. normal with 
mean 0 and variance o”, with o? unknown. For example, if f(x) = x77} for 
J= 1,...,&, (1.1) is the usual polynomial regression model of degree k — 1. More 
generally, all of the results of this paper have obvious analogues when the 
random vector e = (e,,...,@,)’ has a spherically symmetric distribution about 0. 

We use f(x) to denote the vector (f,(x),..., f,(x))’ for x € R. Let IC Rbea 
closed (not necessarily finite) interval fixed for the remainder of this paper. f 
defines a function from I to R* which we assume to be continuous, bounded 
away from the origin, and piecewise differentiable with /,|If(x)||? dx finite. b is 
used to denote the least-squares estimator of b = (b,,..., bY and s? denotes the 
usual unbiased estimator of o”. We assume the design matrix is of full rank so 
that b ~ N,(b, 0) for some known positive definite matrix 2, vs?/o? ~ x?, 
where » = n — k, and b and s? are independent. Let P be a k x k matrix such 
that P’P = ZÈ. 


Received October 1984; revised September 1985. 

! Research supported in part by National Science Foundation Grant No. DMS-8403646, and by 
Office of Naval Research Contract No. N00014-79-C-0801. 

AMS 1980 subject classifications. Primary 60E15, 62302; secondary 62305, 62F25, 60D05. 

Key words and phrases. Curvilinear regression, confidence band. 


896 


CONSERVATIVE CONFIDENCE BANDS 897 


Suppose we wish to construct a simultaneous confidence band for the regres- 
sion function b’f(x) for x € I. We consider Scheffé-type bands, that is, bands of 
the form 


(1.2) b’f(x) + csp(x) forx <I, 


where p(x) = {f(x)Zf(x)}'” = || Pf(x)||, and c > 0. The coverage probability of 
the band is defined to be the probability that all of the intervals (1.2) cover b’f(x) 
simultaneously, as x ranges throughout J. The main result of this paper (Theo- 
rem 4.1) gives a lower bound for this probability, which allows for the construc- 
tion of a conservative band. 

The problem of obtaining simultaneous confidence bands in regression has 
received a considerable amount of attention recently. Results for multiple regres- 
sion functions under restrictions on the predictor variables have been obtained by 
Casella and Strawderman (1980), Uusipaikka (1984) and Naiman (1984). Wynn 
and Bloomfield (1971) considered quadratic regression over the real line. Knafl, 
Sacks and Ylvisaker (1985) presented a numerical method for estimating the 
coverage probability of Scheffé-type bands in polynomial regression, based on the 
use of an inequality for the distribution of the maximum of a Gaussian process. 
Wynn (1984) obtained results in the polynomial regression context for a different 
class of bands, using results from the theory of quadrature. 


2. Expression for the coverage probability. Before giving an expression 
for the coverage probability of the band (1.2) we introduce some notation and 
definitions. S*~! denotes the unit sphere centered at the origin in R*. Throughout 
this paper U denotes a random vector with a uniform distribution on S*~1. is 
used to denote the uniform probability measure on S*~}, F, , denotes the F 
distribution function with i numerator degrees of freedom and j denominator 
degrees of freedom. 


DEFINITION 2.1. A path in S*~! is a piecewise differentiable function y 
mapping J into S*~' such that A(y) = f,|ly’(x)|| dx, which we refer to as the 
length of y, is finite. The image of the path is defined by T(y) = {y(x): x € J}. 

Note that it is possible for a path to overlap itself so that the length of the 
path is not necessarily the same as the length of the curve that T(y) defines. 

For any closed subset T of S*~! and u € S*~! define 

e,(u) = sup{u’v: v ET}. 
For r € [0,1] define 
Ty = {ue S4: ep(u) > r} 
a {u e S4}: |v — ull? < 2(1 — r) for some v E€ T}, 


so that T,„ is the set of points in S*~! which are within {2(1 — r)}⁄ of T. Note 
that T,„ is empty for r > 1. 
We define 


(2.1) y(x) =|| PE(x)| Pix) forx € I, 


898 D. Q. NAIMAN 


where p and f are defined in Section 1. Using the assumptions given about f in 
Section 1 it is easy to verify that y and — y are paths in S471, 

Lemma 2.1, which is due to Uusipaikka (1984), expresses the probability that 
coverage fails for the band (1.2) as the probability that a random vector U 
distributed uniformly on the unit sphere lies within the (random) distance 
(20 — cT)}⁄ of T(y) U — (Y), where T = 8/|BIl 


LEMMA 2.1. The coverage probability of the band (1.2) is given by 
a fr {T(Y) U -T(y)} en] fr(t) at, 


where fy denotes the density function of T = s/\B\|, so that kT? ~ F, ,. 


Proor. Let B = P'-!(b — b) and U =|[B||"!B. B has a k-variate normal 
distribution with zero mean vector and covariance matrix o7J,, independent of s, 
and U has a uniform distribution on S*~! independent of (||B||, s). If T = s/||B|| 
it follows that kT? ~ F, ,, and T is independent of U. 

The probability that simultaneous coverage for the intervals (1.2) fails is given 
by 


P| sup (PENO ~ b))} = es or sup (n(x) (A(x) — b))} > os 
= P| sup {y(x)U} > cTor pupa] > er] 


= Plu e {T(y)u- T(y)}cenl- 


Conditioning on T and using the independence of U and T leads to the desired 
result. O 


3. Upper bound for p{T(y),,)}. For a given path y in S*~) the main result 
of this section, Theorem 3.1, gives an upper bound for y{I(y);,)}, the uniform 
measure of the set of points within a given distance of the image of y. The upper 
bound may be interpreted as follows. If we replace y by y*, a path of the same 
length but whose image lies on a great circle, then the bound may be thought of 
as #{I'(y*),,)}, except that we calculate this by ignoring overlap and instead of 
counting points in I'(y*),,, which are closest to multiple points in I'(y) once, we 
count them according to their multiplicities. Thus, we obtain a bound which 
depends only on the length of the path and consists of two terms. The first term 
is proportional to the length of the path and corresponds to the “tube” of points 
in I(y*),,) which are closest to points in the interior of I'(y*). The second term is 
the sum of the measures of two half spherical caps of angular radius cos” '!r 
corresponding to the points in I'(y*),,, which are closest to one of the endpoints 
of y*. 

The proof of Theorem 3.1 may be sketched as follows. It suffices to consider 
the case when I(y) is piecewise composed of great circular arcs, since I'(y) can be 
approximated by curves of this type. In Lemmas 3.3 and 3.4 we prove the 
inequality described above for a path whose image is composed of great circular 
arcs by induction on the number of arcs. The piecewise great circular curve is 


CONSERVATIVE CONFIDENCE BANDS 899 


replaced by a curve of equal length on a single great circle by “straightening out” 
the curve at each point where the circular arcs are joined. For a path whose 
image is on a great circle the geometry of the problem is simple and exact 
formulas are obtained in Lemmas 3.1 and 3.2. 

We first state some definitions and give some lemmas used in the proof of the 
theorem. Let r € (0,1) be fixed throughout this section. 


DEFINITION 3.1. A great circular arc in S*~* with endpoints a and b is a set 
of points of the form 


T{x:x € S*-!, x? + x? = 1, x, 2 t, x, 20}, 
for some ¢t € (0,1), and some orthogonal transformation T, where a = 
T((1,0,0,...,0)) and b = T(t, {1 — £7}, 0,...,0)). The length of the arc is 
cos” lt, 
For a given great circular arc A in S*~! with endpoints a and b we define the 
following sets: 
C(A) = {u € A;n: c4(u) > max{wa, ub} }, 
D(A,a,b) = {u € Ap: c,(u) = wa}, 
E(A,a,b) = {u € A,n: calu) = ub}. 


REMARK 3.1. If A is any closed subset of S*~! it is easy to verify that 
(TA),,, = T(A(,)), 80 p{(TA) n} = (A;n), for any orthogonal transformation T. 


REMARK 3.2. It follows easily from the above definitions that C(TA) = 
T(C(A)}, D(TA, Ta, Tb} = T{ D(A, a,b)}, and E(TA, Ta, Tb) = T(E(A, a, b)}, 
for any great circular arc A and for any orthogonal transformation T. 


Lemma 3.1. Let A be the great circular arc given in Definition 3.1 with T 
being the identity map. Then 


(i) C(A)= {u e S4: u? + uf > r°, u,/{u? + "A > t, uz > 0}, 
(ii) D(A,a,b) = {ue S*"1':u, >r, u, < 0}, 
(iii) E(A,a,b) = {u € S*1: tu, + {1 - t°} Žu, >r, 


u,/{u? + už}? < t\. 


Proor. Fix u € S*~! and let c= c4(u). Define A(s) = (s, {1 — 87}, 
0,...,0)u for s E [¢, 1], so that c = sup{h(s): s € [é,1]}. 

To prove (i), let C’ denote the set on the right-hand side of the equality in 
(i). If ue C let v = (u,/{u? + uz}, u,/{u? + u2}'”,0,...,0). Clearly 
v EA, hence c> uwv = {u? + uż}? >r, so u €A. Using the fact that 
t < u,/{u? + uł}? and u, >0 it is easy to show h(1~) > A(1), hence 
c > h(1) = wa, and A(t*) > h(t), hence c > A(t) = u’b. This proves u € C(A). 

Now suppose u € C(A) so that c > r and A is maximized at some s € (t, 1), 
where h'(s) = u, — su,/{1 — s*}' = 0, and h(s) > max{A(t), h(1)}. u, must 


900 D. Q. NAIMAN 


be nonzero since u, = 0 and h’(s) = 0 together imply u, = 0 hence h(s) = A(1). 
It follows that s/{1 — s?}/*=u,/u, and A(s) = {u? + uł}? > r. Also 
h’'(8) = —u,/{1 — 87} < 0 80 u, > 0. We have 

0<t/{1~ Y’ <8/{1- 87} = w/uy, 
80 u > 0, and if we apply the monotone increasing function g(x) = x/{1 + x7} 
we obtain £< u,/{u? + u2} so u € C’ and the proof of (i) is complete. 

To prove (ii), let D’ denote the set on the right-hand side of the equality in (ii). 
If u € D then u, = wa=c>r and A(s) is maximized at s = 1. Since A(17) < 
h(1), it follows that u, < 0, so u € D’. 

If u € D’ then A(s) = u, — su,/{1 — s”}' > 0 for s € [#, 1]. It follows that 
h is maximized at s = 1, so c = w'a. Furthermore, w'a = u, > r, s0 u €E D. 

For (iii), let T be the orthogonal transformation on S*~'! defined by 


T(u) ms (tu, + {1 = Aas {1 = Oye: = tuz,0,-..,0)'. 
Then T(A)= A, T(a)=b, and T(b)=a. It follows that T( IXA, a,b) = 


E(A, a,b), and using (ii) it is easy to verify that T( D(A, a, b)) is the set given on 
the right-hand side of (iii). O 


REMARK 3.3. From Lemma 3.1 it follows easily that A,,, is the disjoint union 
of the sets C(A), D(A, a,b), and E(A, a,b). 


LEMMA 3.2. IfA is a great circular arc in S*~* with endpoints a and b and 
length L, then 
u{C(A)} = Fy_oo(2(r7? — 1)/(k — 2)) x L/(27) 
and 
u{D(A,a,b)} = n{E(A,a,b)} = Fy. ((r-? — 1)/(k — 1))/4. 


Proor. By Remarks 3.1 and 3.2 it suffices to consider the case when A is as 
given in Definition 3.1 with T being the identity map. 

Write U = |X| 'X where X ~ N,(0,J,). Then |KU, Uyl (Uj, UY = 
(Xo X_)'l| X, XY is independent of (X? + X2, X}, ..., Xpy so if we let 
F = (ŒE X?)/k — 2}/{(X? + Xž)/2}, then F ~ Fp-o2 independent of 
K(X, X,Y X, X,Y. 

Using Lemma 3.1(i) we obtain 


P(U € C(A)) = P{{X? + X3} AKI? = r?, X,/{X? + XP)" > t, 
X,/{X? + X}}'” > 0) 

= P(1/{1 + (k — 2)F/2} = r*)P(X,/{X? + X3}'" > t, 

X,/{X? + X3}'7 > 0) 


= P(F < 2(r-? — 1)/(k - 2)) x L/(27), 
and the proof of the first equality is complete. 


CONSERVATIVE CONFIDENCE BANDS 901 


To prove the second equalities, note that by symmetry 
P(U € D(A,a,b)) = P(U € E(A,a,b)). 


Let X be as above and note that F" = £$, X?/((k — 1)X?) ~ Fp- Using 
Lemma 3.1 and the independence of X,/|X_| and (X,, X?, X3,..., Xpy we obtain 


P(U €e D(A, a,b)) = P(X AIXI = r, X./|XQ| < 0) 
= P(X,/||Xl| = r)P(X,/1X| < 0) = P(X AXI = r)/2 
= P( X?/(Kl? = r?)/4 = P(1/{1 + (R- DF} > r?)/4 
= P(F' < (r° — 1)/(k- 1))/4, 
and the proof is complete. 0 


LEMMA 3.3. Let A, and A, be great circular ares in S*~) with common 
endpoint (1,0,...,0)' so that 


A, = (u - 52}, sv)": 0<8 < s,) 


for some v, € S*~*, and s, € (0,1) for i= 1,2. Thus A, has endpoints a, = 
({1 — s2}/?, s1) and b, = (1,0,...,0)’, and A, has endpoints a, = (1,0,...,0) 
and b, = ({1 ~ 83), 8293)’. Set B, = (A,)rs C, = C(A,), D, iT DXA,,a,, b,), 
and E, = E(A,,a,,b,) for i = 1,2. 

Define 


F, = {(t,x’)’ € St): xv, > 0, xv. >0,rst<1}, 
F, = {(t,x’)' € St xv, < 0, xvw 5 0,r<t< 1}. 
Then we have the following: 
(i) (B, U Ba) — (C UD, U G U Ez) € Fj; 

(ii) F, c (C; UD) A (C U E); 

(iii) (F3) = (F). 

REMARK 3.4. Since Lemma 3.3 is the basic tool used to prove Lemma 3.4 
some discussion is appropriate. Consider the set A = A, U A». A; is composed 
of four pieces, namely, C, U D,, Ca U E,, S, the set on the left-hand side of 
(i), and T, the set on the right-hand side of (ii). Using (i)-(iii) we see that 
w(C, U D,) + w(C, U E) is an upper bound for 4(A,,,). We can interpret this 
upper bound as (At), where A* is A “straightened out,” that is, A* has the 
same length as A but its components all lie on the same great circle. 


Proor. We use c, to denote c, for A = A, i = 1,2. To prove (i), define 
G,= {(t,x)' €S*:xv,<0,rst<1},  i=1,2. 
We will show below that 


(a) E, — (C: U E,) C D, and D; — (C, U D,) € E, 
(b) E, C G, and D, C Gy. 


902 D. Q. NAIMAN 


To see that (i) follows, we have 
(B, U B,) - (C, U D, U C U E;) € [B, - (C, UD, UG, U E,)| 
U[B, - (C U D, U G U E,)] 
ig [E, - (QU E,)| U [D, —(C,v D,)| 
CEADCG,NG,=f, 
where the last line uses (a) and (b). 

To prove (a), suppose u € EF, — (C, U E,). Then u © E, so c,(u) = wb, zr. 
But b, € Ag, so c,(u) > w/b, 2 r, thus u € B, and it follows that u € D,. The 
second claim in (a) is proved in exactly the same manner. 

To prove (b), note that if u = (¢,x’) € E, then c,(u) = r and c,(u) = ub, = ¢ 
so t > r. Let h(s) = (t,x)({1 — 87}, svi) for s € [0, s,]. Since the supremum 
of h(s) is attained at s = 0, 0 > h’(0) = x’v,, thus u € G,. The proof of the 
second claim is the same. 

To prove (ii), let u = (t,x) € F, so that r < t< 1 and x’v, > 0 for i = 1,2. 
Then c,(u) = u’b, = t > r, so u € B,. If h is the function defined above, then 
h'(0) = xv, > 0, so h(s) is not maximized at s = 0, and hence c,(u) > ub. 
Thus u € B, — E, = C, U D, This proves the first claim. The proof of the 
second claim is the same. 

For (iii), if T is the orthogonal transformation defined by T(t, x’)) = (t, -x’Y, 
then p{F,} = p{TF,} = p(F,}. O 


LEMMA 3.4. Let A, be a great circular arc with endpoints a, and b, for 
i = 1,..., mand assume a, = b,_, fori = 2,..., m. Set B, = (A,)ry C, = C(A,), 
D, = D(A,,a,,b,), and E, = E(A,,a,,b,). Then 


p| ÜB.) < È (C) + (D) + (En). 


t=] 


Proor. The proof is by induction on m. For m = 1 the result follows from 
the fact that B, = C, U D, U E,. Assume the result holds for m = M 2 1 and 
consider the case when m = M + 1. We can assume without loss of generality 
that b, = a, = (1,0,...,0)’, since if necessary, we can apply an orthogonal 
transformation and use Remarks 3.1 and 3.2. Thus, we may assume that A, and 
A, are given as in the statement of Lemma 3.3. Let F, F, be as in the statement 
of Lemma 3.3 and define 


H,=C,UD, 
and ; 
M+1 
H, = U B,— D}. 
t=2 
We will show the following: 


0) UTB, — (H, U H3) C Fy 
Gi) F, c H 9 A. 
To prove (i), note that for i > 2, 
B,- H, c D, = B, — (C, U E3). 


CONSERVATIVE CONFIDENCE BANDS 903 


Thus 

(B, — H,) — H, c (B; — (C3 U E,)) - (C, UD), 
so we obtain 
(3.1) B, — (H, U H,) c (B, U B,) - (D, UC, U G U E,). 


Since D, UC, U C, U E, C H, U H, it follows that (3.1) holds for i = 1 and 2. 
Thus, 

M+1 

U B,- (H, U H,) c (B, U B,) — (D; UC, U G U Ey) 


p=] 
and the result follows from Lemma 3.3(i). 
To prove (ii), note that 


C, U E; = B; ~ D; € Hp, 
80 
(CUD) Aa (CU E) c MAH, 
and the result follows from Lemma 3.3(ii). 
Using (ii) and Lemma 3.3(iii) 
(3.2) (H, U H,) < p(H,) + oC) — p(F,) = (A) + oC) - p(F3). 
By the induction hypothesis 


M+1 
p(H,) = | U B, ~ p(D,) 
Mt+i 
(3.3) <s L u(C,) + p(D,) + (Em4) — #(Dz) 
M+1 


= $ u(C,) + u(Em+). 


1=2 
It follows from (i) and inequalities (3.2) and (3.3) that 
M+1 M+1 
| U a < | U B,- (H, U H})| + (H, U H3) 
t=] t=] 
< (Fz) + w(H, U H3) < p(H,) + »( AQ) 


M+1 
© sp(C,) + a(D,) + X u(C,) + wl Ey), 


t=2 
and the result holds for m = M + 1, so the induction is complete. O 
THEOREM 3.1. If y is a path in S*~! then 
HT (y)on} < min{ F,_»,9[2(r-? ~ 1)/(k — 2)] x A(y)/(27) 
+ Fy al(r? — D/(k =D]: 


Proor. If y is a path such that I'(y) = U7L,A, where the sets A, are as 
given in the statement of Lemma 3.4, then the inequality for p{I(y),,)} follows 


904 D. Q. NAIMAN 


from Lemmas 3.2 and 3.4, and the fact that u is a probability measure. The 
general case follows from the fact that we can approximate I(y) by sets of this 
form. O 


REMARK 3.5. A result due to Hotelling (1939) gives equality in Theorem 3.1 
under two conditions. The first condition is that there be no “local self-overlap- 
ping” of the “tube” I’(y),,), which amounts to the condition that the path be 
twice differentiable and that the radius of curvature of the path at each point be 
at least (1 — r?)'”*. The second condition is that there be no global overlapping 
of the tube, which occurs when points lie within parts of the tube corresponding 
to nonneighboring arcs of T(y). 

A generalization of Hotelling’s result due to Weyl (1939) gives an exact 
expression for p(T,„), for T a manifold contained in S*~', when r is sufficiently 
large and no global overlap occurs. 


4. Conservative confidence bands. We now apply Theorem 3.1 to define 
conservative confidence bands in the context of Section 1. 


THEOREM 4.1. The following is a lower bound for the coverage probability of 
the band (1.2), 
lfe 


1- fmin(F,-2,2[2((et)* ~ 1) /(k -= 2)] x A(y)/2 
+Fy_sa[((ct)? - 1)/(k - 1), 1}fr(t) dt, 


where fp denotes the density function of T, a random variable such that 
kT? ~ F, ,. Thus, if c is such that the above expression is at least 1 — a € (0,1) 
then the tand (1.2) is a 10X1 — a)% conservative confidence band. 


Proor. Since p{T U — T} < 2a(T) the result follows from Lemma 2.1 and 
Theorem 3.1. O 


Some applications call for a one-sided confidence band. For upper (or lower) 
Scheffé-type bands, i.e., bands of the form 


(4.1) b'f(x) + csp(x) or bi(x)-— csp(x) forxe TI, 
a proof similar to the proof of Lemma 2.1 yields the following expression for the 
coverage probability, 


lfe 
1 = f WEO) coo) f(t) a, 
and this leads to the following result. 


THEOREM 4.2. Under the assumptions given in Section 1 the following 
expression is a lower bound for the coverage probability of the upper (or lower) 
simultaneous confidence band (4.1), 


1 fmin{ Fs, 2[2((et)? — 1) /(& = 2)] x AC)/2) 
+ Fyaaa{((et)? — 1) Ck - 1)] 72, 1)fr() dt. 


CONSERVATIVE CONFIDENCE BANDS 905 


Thus, if c ıs such that the above expression is at least 1 — a € (0,1) then the 
band (4.1) is a 100(1 — a)% conservative confidence band. 


REMARK 4.1. Theorems 4.1 and 4.2 are easily seen to give strict improve- 
ments over the Scheffé method, for which the critical constant in (1.2) is given by 
Cs = {kF, , a}. This is because the integrands in Theorems 4.1 and 4.2 are 
bounded by f(t), with strict inequality for values of t sufficiently close to 1/c. 
See Section 5 for numerical comparisons in the case of quadratic regression. 


5. Example. Theorem 4.1 can be used to construct conservative two-sided 
confidence bands for quadratic regression over J, an interval subset of R. In this 
section we compare bands constructed using this method to those constructed by 
other methods. The results are summarized in Table 1, which gives ratios of 
critical points, or equivalently, ratios of band widths. 

For quadratic regression in (1.1) we take k = 3, and f(x) = x77! for 7 = 1,2,3. 
For the case when J = R, Wynn and Bloomfield (1971) (Section 3.4) show that 
the image of the path in (2.1) is the intersection of the cone 


(5.1) (8/(1 + 8))x? + (1/(1 + 8)) x3 = x? 


in R? with the unit sphere S?, where ô € [0,1] is a constant depending on the 
design. They tabulate the constant cw for which the band (1.2) has an exactly 
prescribed coverage probability for 6 = 0, 0.5, and 1. 

For the purpose of comparing band widths, Table 1 gives ratios c/cyw,, where 
c is the constant obtained using Theorem 4.1. Note that the ratios are fairly close 


TABLE 1 
Ratios of critical points for quadratic regression 


a 8 v c/Cwz e/ Cs Co/ Cw 
0.01 0.0 10 1.026 0.996 0.66 
0.0 +0 1.027 0 996 0.68 
0.5 10 1.046 0.980 0.65 
0.5 +00 1,047 0.980 0.67 
1.0 10 1.015 0951 0.67 
1.0 +00 1.016 0.951 0.69 
0.05 0.0 10 1.025 0.964 0.69 
0.0 +o% 1.023 0.964 073 
0.5 10 1.045 0.979 0.68 
0.5 +00 1.042 0.979 0.72 
1.0 10 1.014 0.950 0.70 
1.0 +o 1.013 0.951 0.74 
0.10 0.0 10 1.022 0.967 0.74 
0.0 +00 1.015 0.967 0.79 
0.5 10 1.043 0.982 073 
0.6 +œ 1.036 0.981 0.78 
1.0 10 1.014 0.954 0.75 


906 D. Q. NAIMAN 


to unity. This is in spite of the fact that we should not expect the inequality in ` 
Theorem 4.1 to be very sharp when [(y) forms a closed curve, due to the fact 
that the F,_,, term in Theorem 4.1 is unnecessary when the Hotelling (1939) 
result applies. The Scheffé (1953, 1959) method can be used to give simultaneous 
confidence intervals for all linear combinations of the three unknown parameters, 
and restriction to quadratic regression leads to a conservative confidence band. 
Table 1 gives the ratio c/cg, where cg = {3F, 3,,} 7, the critical point for the 
Scheffé method. 

For the case when I is a proper subset of R use of cwp leads to conservative 
bands and for sufficiently small intervals, one would expect to obtain narrower 
bands by using Theorem 4.1. In order to indicate the greatest potential improve- 
ment we give the limiting ratio c/cwp, as I shrinks to a point. Clearly, this 
equals Cy/Cywp, where c, is the critical point obtained by using Theorem 4.1 with 
A(y) = 0. Note that use of Theorem 4.1 can lead to considerable savings in band 
width. 

In preliminary calculations for compiling Table 1, little variation was found in 
ratios of critical points as a function of vy, for » = 10, 20, 40, and + oo. For this 
reason, only the values for » = 10 and +o were included. 


` 


Acknowledgments. I am grateful to Esa Uusipaikka for showing me Lemma 
2.1. I wish to thank Seren Johansen for the references to Hotelling (1939) and 
Wey] (1939), and for his many suggestions on writing this paper. I would also like 
to thank the referees for their many comments which led to a much improved 
version of this paper. 


REFERENCES 


CASELLA, G, and STRAWDERMAN, W. E. (1980). Confidence bands for linear regression with restricted 
predictor variables. J. Amer. Statist. Assoc. 75 862-868. 

HOTELLING, H. (1939). Tubes and spheres in n-spaces, and a class of statistical problems. Amer. J. 
Math, 61 440-460. 

KNAFL, G., SACKS, J. and YLVISAKER, D. (1985). Confidence bands for regresmon functions. J. Amer. 
Statist. Assoc. 80 683-691. 

NAIMAN, D. Q. (1984). Simultaneous confidence bounds for multiple regression functions over regions 
defined by constraints on the predictor variables, Unpublished manuscript. 

SCHEFFE, H. (1953). A method for judging all contrasts in the analysis of variance. Biometrika 40 
87-104. 

ScHEFFÉ, H. (1959). The Analysts of Vartance. Wiley, New York. 

UUSIPAIKKA, E. (1984). Exact confidence bands on certain restricted sets in the general linear model. 
Unpublished manuscript. 

WEYL, H. (1939), On the volume of tubes. Amer. J. Math. 61 461-472. 

WYNN, H. P. (1984). An exact confidence band for one-dimensional polynomial regression. Bio- 
metrika 71 375-380. 

Wynn, H. P. and BLOOMFIELD, P. (1971). Sunultaneous confidence bounds in regression analysis. J. 
Roy. Statist. Soc. Ser. B 33 202-217. 


DEPARTMENT OF MATHEMATICAL SCIENCES 
THE JOHNS HOPKINS UNIVERSITY 
BALTIMORE, MARYLAND 21218 


The Annals of Statistics 
1986, Vol. 14, No 3, 907-924 


SPHERICAL REGRESSION! 


By TED CHANG 
Simon Fraser University and University of Kansas 


Suppose u,,..., U„ are fixed points on the sphere, v,,..., 0, are random 
points such that the distribution of v, depends only upon vfAu, for some 
unknown rotation A. This paper provides asymptotic tests and confidence 
regions for A and for the axis of rotation of A. Results are given in arbitrary 
dimension. 


Let SP be the unit radius sphere in p-dimensional Euclidean space and let 
SO( p) be the p X p orthogonal matrices (that is matrices A such that AA‘ = I) 
of determinant 1. We consider in this paper “spherical regression” problems on 
the following model: u,,..., u, are fixed points in S? (written as column vectors), 
0,,---, 0, are random points in SP such that v,,..., 0, are independent and such 
that the density of v, with respect to uniform measure on S?, is of the form 
g(v'Au,) for some unknown A in SO( p). We want to develop statistical proce- 
dures for estimating and testing the unknown parameter A. 

The case of the circle ( p = 2) is essentially well known because A is counter- 
clockwise rotation by an unknown angle @. If 0, is the angle from u, to v, then 
6,,..., 6, are independent and identically distributed with a density of the form 
8(9, — 9). 

The case of the sphere (p = 3) is of considerable practical importance. The 
following two problems are abstractions of problems proposed to the author by 
workers in other fields; the first from geology and the second from petroleum 
exploration. It was the simultaneous and fortuitous presentation of these prob- 
lems that lead to the present study. 


PROBLEM 1. A rigid body, confined to the surface of the earth, has moved in 
an unknown manner. For certain points (u,) on S°, estimates of past position at a 
fixed point in time (v,) are available. What was its previous position? 


In this problem the body’s past position relative to its present position is 
determined uniquely by an element A of SO(3). The v, are estimates of Au, and 
the problem is to determine A. 


PROBLEM 2. The directions (v,) of certain signals have been measured in an 
unknown coordinate system. The directions (u,) of the same signals in a known 
coordinate system can be calculated. What is the unknown coordinate system? 


Received August 1984; revised December 1985. 

1 Listings of. FORTRAN programs implementing in three dimensions the procedures outlined in this 
paper are available from the author. They are available in single precision using the IMSL library or 
in double precision using the NAG library. 

AMS 1980 sulyect classification. Primary 62J99. 

Key words and phrases. Estimated rotations on spheres. 


907 


908 T. CHANG 


In this problem if the rows of A are the components of the coordinate axis of 
the unknown coordinate system with respect to the known one, the v, are 
measurements of Au, with error and the problem is to determine A. 

Variations on Problem 1 are of especial interest in the study of plate tectonics. 
Geophysicists have been fitting rotations to the motion of tectonic plates for 20 
years. Only some of the data they use can be modeled in the form of problem 1. 
The approach has been to define an error sum squares SSE(A) which depends 
upon the choice of a candidate rotation A, to iteratively minimize SSE(A), thus 
arriving at an estimate A of the unknown rotation A, and to assume an 
approximating distribution for 


SSE(A) — SSE(A) 
SSE(A) 


Examples of this procedure can be found in Le Pichon (1968, 1973), Chase 
(1972) and Engebretson, Cox and Gordon (1984). No attempt is made to prove the 
correctness of the assumed asymptotic distribution. The author has found that 
for the choice of error sum squares studied in this paper, the asymptotic 
distribution is not 2nx?(3) as one might assume. Nevertheless, if the error 
distribution is concentrated, as those in plate tectonics seem to be, the true 
asymptotic distribution is in fact extremely close to 2nx7(3). The author hopes 
that this paper can be a start towards a more rigorous and mathematical 
understanding of these problems. 

If Co = E(vfAu,) > 0, it is reasonable to estimate A by the matrix A which 
minimizes 

Llo, — Au,|? = 2n — ay v'Au,). 

L 
Letting U, and V, be the p X n matrices whose columns are u, and v, 
respectively, the solution for Â was found by MacKenzie (1957) and Stephens 
(1979). It is readily computable from a modified singular value decomposition 


U,V! = O,AO! 


where O,,0, E€ SO(p) and A is diagonal with entries A,,...,A, satisfying 
A, =A, = +--+ 2 JA l. If the rank of U, is p, the determinant of U,V" is nonzero 
with probability 1 andi in that case, Â is uniquely given by 0,0}. We will call A 
the “least squares estimate of A.” In this paper we will find the asymptotic 
distribution of A under the assumption that (1/ n)U,U,; converges as n —> 00 toa 
positive definite symmetric matrix = (Theorem 1). We propose that asymptotic 
confidence regions for A be based upon Theorem 1. 

Letting O(p) denote the p xX p orthogonal matrices, we define, following 
Stephens (1979), for a subset S of O( p) the vector correlation r(S) by 


r(S) = sup — E Boilu, 
Aes” t 


Stephens studied the distribution of r(SO(p)) and r(O(p)) when the u, and v, 
are independently and uniformly distributed on the sphere S”. Using Theorem 1, 


SPHERICAL REGRESSION 909 


we will find for closed subgroups G’ C G of O( p), the asymptotic distribution of 
r(G’) and of r(G) — r(G’) when A € G’ (Theorem 2). 

With G = SO( p) and G’ = {I}, we propose to use Theorem 2 to test whether 
A is some specified A, or not. The resulting test is based upon the test statistic 
r(SO( p)) — r(A,) (where, abusing the notation, we write r(A,) for r({AQ})). 

When p = 3, the matrix A will represent rotation of an angle # about an axis 
£. In both of the problems cited above it is of interest to test if € is some 
predetermined £,. If we let G’ be the group (isomorphic to SO(2)) of rotations 
around é), we are testing the hypotheses A € G’. We propose to base an 
asymptotic test on Theorem 2 and on the test statistic r(SO(3)) — r(G’). 

Gould (1969) considered another inequivalent type of spherical regression 
model. For the sphere S°, the Gould model is that the v, are independently 
Fisher distributed with model vector u, = (cos ¢,, sin ¢,cos 6,, sin ¢,sin 6,) with 9, 
and 6, known linear functions of the unknown parameters. Gould also considers a 
similar model on the circle S?. 

In Section 1, we state and prove Theorems 1 and 2 in arbitrary dimensions. In 
Section 2, we describe asymptotic hypotheses tests with special attention to three 
dimensions. Section 3 contains a numerical example and Section 4 discusses 
display of confidence regions for A. 

If the underlying distribution is Fisher, d(«)exp(xv‘Aw), the procedures in this 
paper are just maximum likelihood estimation and likelihood ratio testing. For 
other distributions, the author believes that use of least squares estimates A are 
justified by the relative ease of computing A. 

In this paper, the u, play the role of the predictor variables in linear 
regression: They are assumed fixed and v, is assumed to have a rotationally 
symmetric distribution centered at Au,. If they are instead random but with a 
distribution independent of A, the results are still valid for inference conditional 
on the u, When the distribution of u and the conditional distribution of v are 
both Fisher, aspects of this problem were studied by Rivest (1984). 


1. Statements and proofs of the main theorems. We will think of 
Euclidean p? space R” as the collection of p X p matrices with the usual inner 
product (A, B) = tr(AB‘). Let O(p) c R” be those matrices A such that 
AA‘ = I. Then O( p) is closed (in the usual metric space sense), has dimension 
2(p(p — 1)) and consists of two connected components. One of these is SO( p), 
the matrices in O( p) of determinant +1, and the other consists of the elements 
of SO( p) followed by any reflection. 

The tangent space at the identity I of O( p) (and hence of SO(p)) is the 
collection of skew-symmetric p X p matrices; that is the matrices H such that 
H + H' = 0. We denote the collection of such H by L(SO(p)). 

The exponential map ¢: L(SO( p)) > SO( p) is defined by 

H? H? 
(H) =I+H +r tzr t rey 

If G is a closed subgroup of O( p) with the metric space topology, we let L(G) 
denote the tangent space at I of G. L(G) is by definition a vector subspace 


910 T. CHANG 


of L(SO(p)). It can be shown (see Theorem 15 and its proof, Spivak (1979), 
page 530) that L(G) is the set of H in L(SO(p)) such that exp tH is in G for all 
real ¢. The dimension of G is the dimension of L(G). If G = SO(p) or O( p), 
dim G = 3(p(p - 1). 

Let A n(G) be the “least squares estimate of A in G,” namely the element of G 
which maximizes 





n é 
2 È vtAu, = tf A UnVn | 
t=] n 
as A varies over G. Thus A „(O( p)) is the statistic defined by MacKenzie (1957) 
and A „a(SO( p)) is Stephens’s (1979) modification of MacKenzie’s statistic. 

If v € SP has density of the form g(v‘u) for some u € S?, we define constants 
Co, Cy, and c, by 


E(v) = cou, 
E|(o — ¢yu)(v — cyu)'| = cuu’ + col. 


That E(v) is a multiple of u is obvious by symmetry. That E[(v — cọguXv — 
C)u)'] can be written in the form c,uu' + cI is obvious when u is the “north 
pole” and follows for general u by rotating S?. 


THEOREM 1. Let G be a closed subgroup of O( p). Suppose each v, has a 
density g(v'A,u,) where A, is in G. Suppose furthermore cy, > 0 and that 
1/nX,u,u! converges to a positive definite symmetric matrix =. Then 


(a) A,(G) is consistent for Ao. 

(b) Write ALA AG) = o(H,) font: € L(G). Then H, is asymptotically multi- 
variate normal with mean 0 and density (with respect to a Lebesgue measure on 
L(G)) proportional to 


2 
Co 
—n tr( H?2 
E] 
Thus —nc?/cotr(H2 2) is asymptotically x*(dim G). 
Most of the proof of Theorem 1 is a mimic of the proofs in the asymptotic 
theory of the mle with the log likelihood function replaced by tr(A(U,V,‘)/n) and 


with nonidentically distributed variates. We will therefore omit many details. 
Let U, = [u +++ uy) Vp = (0, +++ 0a] and X, = 1/nU,Vi. 


LEMMA 1. X,, > co2Aj (strong convergence). 


Proor. Let W, = u,v{ — cọu,utAb. W, isa p X p matrix with expected value 
0. By Kolmogorov’s criterion for the strong law of large numbers (see Billingsley 
(1979), page 250), 1/n=?_,W, converges to 0 with probability 1. The lemma 
follows. 0 


SPHERICAL REGRESSION 911 
Lemma 2. A,(G) is strongly consistent for Ao. 


Proor. A,(G) maximizes tr(AX,) as A varies over G. By Lemma 1, X, > 

C 2A‘ with probability 1. Since È is positive definite and co > 0, 

tr( Ac, Z A$) = cotr( A} AZ) 
is maximized uniquely when AjA = J or A = Ao. The lemma follows from the 
following observation: 

Suppose f is a continuous function on ¥ Xx Y with ¥ compact and suppose 
furthermore that for a specific yy € Y, f(x,y) has a unique maximum at 
x = Xo. Suppose y, > y and each x, is a choice of a maximum for f(x, y,). Then 
x, > Xp O 


Since A,(G) ~ Ap, for large enough n we can write AtA(G) = (H) where 
H,, € L(G) is chosen to have smallest magnitude. By replacing v, with Afv, we 
can assume A, = I. Pick a specific B € L(G) and define a real valued function on 
L(G) 














d 
g2(H)=—| tr(o(H + tB)X,). 
dt | pmo 
We have g#(H,„) = 0. We expand g# in a Taylor series around 0: 
d 
g2(0) c dt tr($(tB)X,) = tr( BX,). 
t=0 3 
If H € L(G), 
d 
B\(O)H=—| — + tB)X, 
(ar OH= Z| gl, Tole + B)x,) 
HB + BH 
A 
2 
Thus 
HB + BH 
gĒ(H) = tr(BX,) + a — —x, | +R. 
Defining for a matrix H, the ordinary Euclidean metric 
IHI? = tr(HH*), 


it can be easily shown that |R| < ||H||?||Blje!", Since g3(H,,) = 0, and since 
tr( HSB) = tr( B'S‘H") = tr( BEH), 
the following lemma is obtained: 
Lemma 3. For B € L(G), 
—tr( Byn X,,) = cotr(vn H, 2B) + Rn 
where 
[R,| < Vall HI| BUCH le + |X, — coll). 


912 T. CHANG 


Let L(G)* be the dual space to L(G). Define a, E L(G)* by a,(B) = 
—tr( Byn X,,). Each a, is a random variable with values in L(G)*. 


LEMMA 4. a, has a limiting multivariate normal distribution with covart- 
ance quadratic form c,Q(B,, B,) = —e,tr(B,2B,) for B,, B} € L(G). 


Proor. To say that the random vector a in L(G)* has covariance quadratic 
form c,Q means that if B,, B, are nonrandom vectors in L(G) the covariance of 
the real valued random variables a(B,) and a( B,) is c,Q(B,, B,). 

The characteristic function of a,, is 


FB) = Elexp(/—1,(B))|,  BeL(G). 
Substituting X, = 1/n¥,u,v! and noting that 
0 = tr[ Bu,u‘] = u‘Bu,, 
since B is antisymmetric and u,u‘ is symmetric, 
n -y1 
F (B) = []E|exp—=— (v, — cpu,)‘Bu, |. 
(B) = [TE jep == (o, ~ cou, 


Since E(v, — cot,) = 0 and 


E|(Bu,)‘(o, — Cott, )(0, — ¢ou,) ‘Bu, | = u!B'( cuu! + cI) Bu, 


ll 


—¢,(u‘Bu,)” — c,(u‘BBu,) 
—ctr( Bu,utB), 


ll 


we have 


E Ca : IBI]? 
F(B)= [[|1+ p #(BuuiB) + of = ji 


t=] 





The remainder o(||Bl|?/n) is bounded uniformly in i by min{||B\|?/ 
6n°/*, || B||?/n] (see Billingsley (1979), equation (26.5)), and hence as n > œ 

F,(B) > exp(c,tr( BZB)) a 

Now let p: L(G) > L(G)* be 
p(H)B = Q(H, B) = -tr(HZB). 

Since = is positive definite ọ(H)H > 0 and p is nonsingular. Lemmas 3 and 4 
imply that 

a, = p(— con H,) + 0,(1), 
and hence p(—¢)/ Jc, vn H,,) has a limiting multivariate normal distribution 
with covariance quadratic form Q. 


Now Q defines an identification of L(G) with L(G)* and this identification is 
p. It follows that H,, has an asymptotic multivariate normal distribution with a 


SPHERICAL REGRESSION 913 


density proportional to 
1 — Co — Co 
= -5 0 ne, evan) 
Eras 
cå 
= exp| —n tr( H,2H,) |. 
2c, 


[Let X* be a random vector with values in a dual space ¥* and let the 
quadratic form Q on ¥ be the covariance of X*. Let X be a random vector 
in ¥ defined by Q(X, B) = X*(B) for all B € Y. If we pick a basis e,,..., €p 
of ¥ and write X = Y,x,e,, let V be the matrix cov(x,, x,). Then Q(X, X) = 
[x, ++: xp] V [x «++ x,]‘] This proves Theorem 1. 


THEOREM 2. (a) If Ay € G, then r(G) has a limiting normal distribution 
with mean c, and variance (c, + c3)/n. 
b) If Ap EH CG, then 2nc,/c(r(G) — r(H)) has a limiting x?(dimG — 
dim H) distribution. 
(c) f AoE KCHCG, then 
dimG — dim H r(H) —- r(K) 
dim H — dim K r(G) — r(H) 
is asymptotically F(dim H — dim K,dimG — dim H). 


PRooF OF (a). With the notation of Theorem 1, 


r(G) = w Ae H) m, 


n 





As before, we can assume A, = I. Then 
UV; 
Va (r(G) — co) = Va tr|(+ H) = | - co | + 0,(1) 





UVa UUs 


~= Co 








= fai | + Vnctr[ H,2] + 0,(1) 


n 


= = E (o, = cou) u, + of). 
The summands are identically distributed with mean 0 and variance 
E[u(o — cou )(v — cou)'u] = u'( cuu! + co] )u 
=¢,+¢.0 
Proor oF (b). Let Q(B,, B,) = —Tr(B,ZB,). We define H (G) by A,(G) = 


A,¢(H,(G)) and similarly H,(H) and H,(K), and again we set A= I. If 
B € L(H) c L(G) then using Lemma 3 with both H,(G) and H,(H), we see 


914 T. CHANG 
that Qin (H,(G) — H,(4)), B) is o,(|| Bi). Thus if 8, is the projection under Q 
of Yn H,(G) to the perpendicular complement of L(H) in L(G), 
Vn (H,(G) — H,(H)) = B, + 0,(1). 
Thus, using Lemma 3, 
2n(r(G) ~ r(H)) = 2n(r(G) ~ r(1)) - 2n(r(H) - r(1)) 
= ¢,@(Vvn H,(G), vn H,(G)) 
— ¢ (vn H,(H), vn H,(H)) + 0,(1) 
= €Q( Bn Ba) + 0,(1). 
Thus 2ne,/c,(r(G) — r(H)) is asymptotically x?(dimG — dim H). 0 
PROOF OF (c). From part (b) we see that up to terms o,(1), vn H (H ) is the 


projection under Q of yn H,(G) to L(H) and that /n(H,(H) — H,(K )) is its 
projection to the orthogonal complement of L(K) in L(H). Part (c) follows. 0 


THEOREM 3. Let ALÂ, (G) = o(H,(G)) for H,(G) € L(G). Then 
Vn (r(G) — cy) and yn H,(G) are asymptotically independent. 
Proor. Using the proof of Theorem 2(a), 
vn (r(G) — co) = vn (tr(X,,) — co) + 0,(1). 
From Lemma 3, for B € L(G), 
a,(B) = —tr( Byn X„) = cox( Vn H,(G)=B) + 0,(1). 
Let 
F,(t, B) = E[exp/—1(«,(B) + t/n (tr(X,) — co))] 


be the joint characteristic function of a, and vn (tr(X,,) — co). Using a proof 
similar to Lemma 4, 


F,(t, B) > exp(c;tr( BEB) ~ t%(c, + ¢2)) 
and the theorem follows. O 
If the density g is unknown, to use Theorems 1 and 2, we need consistent 
estimates of c, and c,. Using Theorem 2(a), we can estimate co consistently by 
@=r(G) if AEG. 


The following proposition provides a consistent estimator ĉ, of c. Using 
Problem 29.4 and Theorem 29.2 of Billingsley (1979) it follows that Theorems 1, 
2(b) and 2(c) are still valid if ĉ, is replaced by @ and c, is replaced by ĉ,- 


SPHERICAL REGRESSION 915 


PROPOSITION 1. If Ag € G and 
1 z 2 
eo =a z U(eAn(G)u,) ; 
then ĉ, > c, in probability. 
LEMMA 5. 1= c2 + c + peo. 
PROOF. 
Evo' = E(w — ¢CoAu)(v — coAu)'] + cRAuu'At 
= (c? + c,)Auu'A' + cI. 
Taking the trace of both sides, we get the lemma. O 


PROOF OF THE PROPOSITION. Setting as usual A, = I, we get Â = A,(G) = 
I + 0,(1). Therefore 


=D (otAu,)” = ZE (oiu) + ol). 


The right-hand side converges in distribution to c + c, + c3. Using the D 
the proposition follows. O 


We now consider models of the form d(x)g(xv‘Au) where the concentration 
parameter x is unknown. If cọ(k) = E(v‘Au) is monotonic we can estimate x 
from the sample statistic r(G) by solving 


c(&) = r(G). 
Theorem 2(a) can then be used for inferences on k. 


REMARK. One might wonder about the necessity of the requirement that G 
be closed. Since the closure G of a subgroup G is still a subgroup, and since 
r(G) = r(G), Theorem 2 remains valid if dim G is always replaced by dim G. 

Theorem 1, however, cannot be generalized to nonclosed subgroups. If G is not 
closed, the dimension of G will always be strictly less than the dimension of G. 
Since A r(G), if it exists, will also be A n(G), we expect A,(G) to exist with 
probability 0. 


An example of this pathology is the infamous “real line embedded in the 
torus” which occurs in SO(4). If r is a fixed irrational number and 


cos@ —sin# 0 0 
G=/|sin@ cosð 0 0 6 a real 
0 0 cosr@ —sin rô || number f’ 


0 0 sinr@ cosré 


916 T. CHANG 


then 
cos, -sinô 0 0 
TA sinf,  cosé, 0 0 
0 0 cos b, = sin 6, 
0 0 sinf, cos, 


The author believes that the generic pathological nature of the nonclosed 
subgroups indicates that they have no practical statistical interest. 


Remarg. If cy = 0, A, might very well be inconsistent. 


For dosp if each v, is uniformly distributed on S? then for each A € O( p), 
Vise, Dp and Av,,.. , Åv, are equally likely. If Â „(G) is the least squares fit for 
Dy). , then AA (G) will be the least squares fit for Av,,..., Av,. It follows 
that ee ‘distribution of ÂG) will be the unique left invariant Hair measure for 
each n and hence Â „(G ) is inconsistent. 

For this example, Stephens (1979) has studied the limiting distributions of 
r(SO( p) and r(O(p)) and we observe that Theorem 2 is also false. 

If cy < 0 and if (—I) € G, then A,(G) > —JAg, and Theorems 1 and 2 could 
be modified to handle this case. However, if cg < 0, it is intrinsically unreason- 
able to study the A which maximizes Yv/Au,. A more reasonable approach would 
be to maximize Lv‘A(—u,) and if this were done, Theorems 1, 2, and 3 could still 
be applied with minor changes. 


2. Hypothesis tests. Suppose H is a closed subgroup of O( p). If ec, and c, 
are known, we can use Theorem 2(b) to asymptotically test if the true orthogonal 
matrix A isin H. 


EXAMPLE. Suppose we wish to test Hy: A = Ay. Then using Theorem 2(b) 
with H = {I} and each u, replaced by Apu,, we have 2nc,/c,(r(O(p)) — r(Ao)) 
is asymptotically x?( p( p — 1)/2) if H, is true. 


EXAMPLE. Suppose p = 3 and we wish to test if A is a rotation about a 
specified unit vector {). Let H be the subgroup of all rotations around $o. If € is 
the correct axis, we have 2nc,/c,(r(O(p)) — r(H)) is asymptotically x7(2). 

To calculate r( H) we note that if A(@) is right-hand rule rotation of 0 radians 
around £,, then 


A(6) =I + sin6L + (1 — cos 0)L?, 


where 


SPHERICAL REGRESSION 917 
and £5 = (4), tz, é)’ Thus 
r(H) = max = SotA(o)u, 
O<@< 
=a,+a,+(a?+ aż)” 


where a, = 1/n¥,v‘L'u,. The fitted angle Ê is specified by sinô = a,/(a? + 
a2)! and cos Î = —a,/(a? + a2)/?. 


> 


The critical region for both tests takes the form that the test statistic is too 
big, indicating, as in linear regression, that the improvement in the fit is better 
than can be attributed to overfitting of the model. 

All these tests are still asymptotically true if cy and c, are replaced by é) and 


é, where 
ĉo = r(O( P)), 


For the convenience of the reader we note the following values of cy, c,, and c, 
for the Fisher distribution d(x)e""" where d(x) = «/sinh« and p = 3: 


Cy = cothk — —, 

0 k k 
2 coth x 

& = — — ——— — eh, 
K K 
1 


3. A numerical example. Geophysicists believe that the Gulf of Aden 
formed as Arabia began to separate from Africa about 20 million years ago. Table 
1 gives the latitudes and longitudes of fracture zone intersections with 3’S and 
3’N magnetic anomalies, digitizing from Figure 8 of Cochran (1981). Geophysical 
theory indicates the Arabian and Somalian plates have been moving so that the 
points u, and v, (for each i) were once coincident. The problem is to fit the 
relative motion of the Arabian plate from the Somalian plate, thinking of the 
Somalian plate as fixed in its present location. 

The choice of the Somalian plate as fixed and the u, points as those on the 
South intersections is arbitrary. If, however, the roles of the two plates were 
reversed, the analysis would change as one would expect: For example, the fitted 
rotation A would be replaced by A‘. 

When the points u, ma v, are converted to Euclidean coordinates, the matrix 
U,V,i/n is 





oie 0.3509 0.4547 0.1425 
= |0.4454 0.6942 0.1867 
B 0.1302 0.1738 0.0547 


918 T. CHANG 


TABLE 1 


u; (Somalia) vı (Arabia) 
Latitude Longitude Latitude Longitude 


13.05 67.56 14.28 58.12 
13.34 57.07 14.54 57.67 
13.89 56.50 15.00 57.16 
14.19 65.97 15.33 56.51 
14.10 55.92 15.25 56.48 
14.21 55.38 16.37 55.93 
12.68 50.95 13.69 51.61 
11.97 47.66 12.78 48.11 
12.06 47.35 12.86 47.89 
11.63 46.80 12.44 46.39 
11.73 46.36 12.58 45.87 
n = 11 points 


and A(SQ(3)) 
3 0.9997 -0.0175 0.0157 
A(SO(3)) =] 0.0180 0.9993 -0.0341 |. 
—0.0151 0.0343 0.9993 


Â represents a rotation of 2.38° around an axis through 25.31°N latitude and 
24.29°E longitude. 

Since the true rotation A is known to be in SO(3) and since SO(3) is a 
connected component of O(3), Lemma 2 implies that 


pr[A,(SO(3)) = 4,(0(3))] > 1 asn > œ. 


Since det(U,V,‘/n) is positive, A(SO(3)) = Â(O(3) in this case. 
For this data set 


ĉo = r(A(SO(3))) = 1 — 0.6812 x 1078, 
ê, = 0.5812 x 1076, 


77 


and 
@, + @ = 0.3867 x 107%, 


McKenzie et al. (1970) found, by fitting the 500 fathom contours on each side 
of the Gulf, that the pole of rotation of the Arabian plate relative to the 
Somalian plate is located at 26.5° N and 21.5°E. If H is the subgroup of rotations 
around that axis, A(H) is a rotation of 2.20° and r(H) = 1 — 0.6579 x 1078. 
Then 


2né 
(1) 3 * [r(SO(3)) — r(H)] = 2.902. 
2 
Comparing this to a x?(2) distribution, we see no contradiction between the data 
of Table 1 and the McKenzie axis. 
McKenzie also fits a rotation angle of 7.6°. If, following Cochran (1981), we 
take 20 million years as the age of the rift and, following La Brecque et al. (1977), 





SPHERICAL REGRESSION 919 


5.37 million years before present as the time that the points u, and v, were 
coincident, this prorates to an angle of 2.04° over the 5.37 million years. If A, isa 
rotation of 2.04° around the axis 26.5°N and 21.5°R, 


r(A,) = 1— 0.1691 x 1075 
and hence 


2ney 
(2) : [r(SO(3)) — r(A,)] = 42.02, 
2 
which needs to be compared to a x°(3) distribution to test A = Ay. 

This spectacularly high level of x? should not cause any excitement. If the 
angle of rotation in A, had been between 2.14 and 2.25° the null hypotheses 
would have been accepted at an approximate 0.05 significance level. The impreci- 
sions in the dating used above make 2.04° in fact indistinguishable from angles in 
that range. 

With r(G) = 1 — 0.5812 x 107ĉ and @, + @, = 0.3867 X 107 we get, using 
Theorem 2(a), that with 95% confidence, 1 — c = (0.5812 + 0.3675) X 1078. As- 
suming a Fisher error distribution, cọ(x) = 1 — 1/« + o(1/«) and hence 1.1 x 
10° < k < 4.7 X 10° with an estimate & = (0.5812 x 1078)! = 1.72 x 106. 

A computer simulation was run using IMSL generator GGUBS with 10000 
runs, the given 11 points u,, and a true rotation of 2.04° around 26.5°N, 21.5°E. 
Three runs were made with a Fisher error distribution and x = 1.0 x 108, 
1.72 xX 108, and 5.0 x 10%. In each run, the test statistic (1) exceeded 2.902 
approximately 30% of the time. This compares with a x7(2) distribution p-value 
of 23%. The test statistic (2) exceeded 42.02 0.01% of the time. 

For this problem, the author has found that the programming of the formulas 
in this paper in single precision led to no significant figures in the computed 
values of x°. If instead the mathematically equivalent formulas 





1 a 
1~-r(G) = Qn Lit Awl? = 1 = ĉo, 


r(G) ~ r(H) = (1 - r(H)) ~ (1 - r(G)), 


(3) 1 ; 1 ’ 
ey z on Xlo, z Au,|? a 8n Lie, met Au,|*, 


1 a 
ê, + ê = — Elv, — Âu,‘ — (1 — &)? 
án“ 


are used, the author has found that single and double precision programming 
yield results agreeing in at least four significant figures. One can conclude that, at 
least for this data set, the formulas (3) above work satisfactorily in single 
precision. 

In this example, the u, suffer from a spherical regression analogue of multicol- 
linearity: They are very close to lying on a small arc of a great circle. This can be 
detected using the matrix Ê = 1/n£,u,u!. It is easily proven that the rank of Ê 
is the dimension of the smallest vector subspace of R? containing all the u,. In 


920 T. CHANG 


the instant case the eigenvalues of Ê are 0.99332, 0.00663, and 0.00005. From 
Theorem 1, we see that multicollinearity causes large variances in A and hence 
small changes in the data will cause unexpectedly large changes in A. Further- 
more, when the estimated A is close to the identity, rotations which are in fact 
quite close in SO(8) might have seemingly disparate axes of rotation. 

The analysis assumes that the u, are known without error or at the very least 
that the conditional distribution of v, given u, is symmetric around Au, A 
preferable model would be: u, has a distribution of the form g(u‘é,); v, has a 
distribution of the form g(v{Aé,); u, and v, are independent with é,,...,&,, A 
unknown. In this situation, the author has been able to prove an analogue of 
Theorem 2 with a much more complicated asymptotic distribution. Alternatively 
for G = SO(3), and H = {J} or SO(2), the author has found more complicated 
test statistics with asymptotic x°(3) or x7(2) distributions, respectively. When 
the latter procedures are applied to the data of this example, with its very 
concentrated error distributions, the values of the x? statistics agree with those 
reported above to four significant figures. The author will report on these results 
at a later date. 

In this analysis the points (u,, v,) are believed to have been simultaneously 
coincident approximately 5.27 million years ago. In fact, geologists have dated a 
sequence of anomalies going back in excess of 100 million years. The general 
practice has been to choose a time interval (which may be shorter than the span 
of the data), assume a constant axis and speed of rotation over the chosen 
interval, and to fit them to all intersections from the chosen interval by the 
process described in the introduction. 

If we define SSE(A) = L|v, — Au,| = 2n — 2nr(A) we see that the distribu- 
tion of 

SSE(A) — SSE(A(SO(3))) 
SSE(A(SO(3)))2n 


is not asymptotically x7(3) as one might assume. It is rather asymptotically 
Co/(Co(1 — ¢9))x2(3). Nevertheless, for extremely concentrated ‘error distribu- 
tions, such as those of the above example, c,/(c¢,(1 — ¢))) is, to very close 
approximation, equal to 1. 





4, Confidence regions for the orthogonal matrix A. Although Theorem 
2(b) can be used to produce confidence regions for the unknown orthogonal 
matrix A, the author believes that Theorem 1 is better suited for this purpose. 

If G is a closed subgroup of O( p) and it is known a priori that A € G, let 
x?_, be the appropriate critical point of the x? distribution with dim G degrees 
of freedom. Let 


g= (snin € L(G) and —tr(H?3) < wart): 


Since ¢(—H) = $(H)*, it is easy to see that the required confidence region is 
A (G)®. 


SPHERICAL REGRESSION 921 


_ Alternatively, we might wish to express our confidence region in one form of 
A,(G) followed by a small perturbation. In this case, since ¢(AHA‘) = Ag(H)A’, 
the confidence region is ʻA (G) where 





¢'= (sonin € L(G) and -tr( H’F’) < aa 


ne? 


and X = A,(G)SA,(G). 
The following alternative definition of the exponential map ¢ might be 
helpful. If H is skew-symmetric, an orthogonal matrix O can be found so that 


6, 0 6, 0 


Here k = [ p/2] and an additional diagonal entry of 0 needs to be added if p is 
odd. Then ¢(H) = ¢(O00'HOO‘) = O¢(0'HO)O! = OAO* where 

cos6, —sin@, cos6, —sind, 
sin,  cos6, | 


0 -9 0 -6 
OʻHO = block diagonal | dl i 


A = block diagonal | sind, osh, 
and an additional diagonal entry of 1 needs to be added if p is odd. 

Asymptotic confidence regions of minimum volume will be achieved if the u, 
can be chosen so that © = (1/p)J. This can be done by using a uniform random 
point generator on S? or, if n = pr for some r, by replicating r times any 
orthogonal basis of Euclidean p space. In that case given two matrices A and B 
of G define a distance function on G by: 


d(A, B) = 62+ --- +602 if det A'B = 1 and A'B has eigenvalues 
ef, ei K er, e7 Fr 
(together with +1 if p is odd) 
where k = [ p/2] and ~7 < 0, < 7, 
d(A, B) = œ if det A'B = ~1. 


It follows from Theorem 1 and the above alternate description of ọ that 
(2ncj/pe)d(Ao, A,(G)) is asymptotically x*(dimG). 
When p = 3 and G = S0(3), the general element of L(SO(3)) is of the form 


(4) H=|t 0 -4 
— ty A 0 


and it can be shown that $(H) is right-hand rule rotation of yt? + t2 + t2 
radians around the axis (t? + t2 + t?) [t t, taJ" 

If we identify L(SO(3)) with R? by identifying an H in the form (4) above 
with [¢,, s, t3]‘, we get the following equivalent description y: R? > SO(3) of 
the exponential map: If x € R?, let @ = |x| and = x/|x|; then y(x) is right-hand 
rule rotation of 0 radians around the axis ¢. In terms of y, the regions € and €’ 


922 T. CHANG 
above become 


v= (voet -Ses ae 


— c 
6 = {Wail ABa)x < Fox? a), 
neg 
where Â = A, (SO(3)). As before, our confidence regions become A¥ or @’A. 


EXAMPLE. We continue with the example of the previous section. We have 
@ = 0.5812 x 1076, ĉ = 1.0000, and estimating È by 
0.3568 0.4532 01753 


1 
$ = ~J u,ut = | 0.4532 0.5924 0.1733 
n 0.1325 0.1733 0.0508 


we have 
AXA = | 0.4470 0.5961 0.1872 |. 
0.1401 0.1872 0.0589 
Using x? = 7.81, the 95% critical point of a x?(3) distribution, we have that g’ 
consists of all ¥([x, x, 2x,]‘) satisfying 
0.3454? + 0.596x2 + 0.0589x? + 0.894x,x, + 0.280x,x5 
+0.374xx5 < 0.413 x 1076 


and the 95% confidence region for A is any rotation of the form A (rotation of 
2.38° around 25.31°N, 24.29°E) followed by any rotation in €’. For example we 
could follow A by a rotation around 0°N latitude, 90°E longitude ({0, 1, 0]*) of at 
most ((0.413 x 10~°)/0.596)!/? = 0.832 x 107? = 0.048°. 

The eigenvectors of I — ADA’ are 


[0.5857, 0.7733,0.2428]*,  [0.8099, — 0.5465, — 0.2131 ]', 





0.3451 0.4470 nasra | 


and 
[0.0322, — 0.3214, 0.9464 ]‘ 


with corresponding eigenvalues 0.00668, 0.99337, and 0.99995. Thus the largest 
rotation in @’ is ((0.413 x 107°)/0.668 x 107?) = 0.451° around an axis 
14.05° N, 52.86° E (= [0.5857, 0.7733, 0.2428]*). 
Every rotation in @’A satisfies the inequalities 
23.60°N < axis latitude < 27.40°N, 
17.52°E < axis longitude < 29.05°H, 
2.00° < rotation angle < 2.78°. 

Hence, these three inequalities are asymptotic at least 95% simultaneous con- 
fidence intervals. 


The confidence region #’A was reexpressed in the form: axis € of, f(axis) < 
rotation angle < g(axis) where f is a subset of S*, and f and g are real valued 


SPHERICAL REGRESSION 923 





247 18 19 20 21 22 23 24 25 26 27 28 29 30 
AXIS LONGITUDE 


LOWER SURFACE 


24 
D 24 
£ 26 


esy 78 19 20 21 22 23 24 25 26 27 26 29 30 
AXIS LONGITUDE 


UPPER SURFACE 


Fia, 1. 


functions on s£. Points on the graph of f and g (the lower and upper surfaces, 
respectively) were calculated and contour maps of the upper and lower surfaces 
drawn using the SURFACE II package developed by the Kansas Geological Survey. 
These maps appear in Figure 1. Thus, for example, all rotations around the axis 
24°E, 26°N with a rotation angle between 2.3 and 2.4° lie in the confidence 
region. 


Acknowledgments. The author wishes to thank Professor Colin Ferguson 
of the Department of Geology, Birbeck College, University of London, and Mr. 
Steven Jones of the Kansas Geological Survey for their extensive help with the 
geophysical aspects of this paper. He would also like to thank the referee for his 
enthusiasm and several helpful remarks. 

The author wishes to thank the State of Kansas General Research Fund for 
funding the preliminary work on this project. He especially wishes to thank the 
Statistics faculty at Simon Fraser University for their generosity in sharing their 
National Science and Engineering Research Council (Canada) grants during the 
term of the majority of the work. 


REFERENCES 


BILLINGSLEY, P. (1979). Probabuity and Measure. Wiley, New York. 
CHASE, C. G. (1972). The N plate problem of plate tectonics. Geophys. J. Roy. Astron. Soc. 29 
117-122. 


924 T. CHANG 


COCHRAN, J. (1981). The Gulf of Aden: Structure and evolution of a young ocean basin and 
continental margin. J. Geophys. Res. 88B 263-287. 

ENGEBRETSON, D., Cox, A. and GORDON, R. (1984). Relative notions between oceanic plates of the 
Pacific basin. J. Geophys. Res. 88B 10291-10310. 

GouLp, A. L. (1969). A regression technique for angular variates. Brometrics 25 683~700. 

LABREcQUE, J. L., Kent, D. V. and CANDE, S. C. (1977). Revised magnetic polarity time scale for 
Late Cretaceous and Cenozoic time. Geology 5 330-335. 

Le Picuon, X. (1968). Sea-floor spreading and continental drift. J. Geophys. Res. 73 3661-3697. 

LE PICHON, X., FRANCHETEAN, J. and BONNIN, J. (1973). Plate Tectonics. Elsevier, New York. 

MACKENZIE, J. K. (1957). The estimation of an orientation relationship. Acta Cryst. 10 61-62. 

McKENZIE, D. P., DARRE, D. and MOLNAR, P. (1970). Plate tectonics of the Red Sea and East 
Africa. Nature 226 243-248. 

Rivest, L. P. (1984). The bivariate Fisher—-von Mises distribution. Unpublished manuscript. 

Spivak, M. (1979). 4 Comprehenswe Introduction to Differential Geometry 1, 2nd ed. Publish or 
Perish, Boston. ` 

STEPHENS, M. A. (1979). Vector correlation. Biometrika 88 41-48. 

DEPARTMENT OF MATHEMATICS 

UNIVERSITY OF KANSAS 

LAWRENCE, KANSAS 66045 


The Annals of Statsiwcs 
1988, Vol. 14, No 3, 925-933 


ASYMPTOTIC CONDITIONAL INFERENCE FOR 
THE OFFSPRING MEAN OF A SUPERCRITICAL 
GALTON-WATSON PROCESS 


By T. J. SWEETING 
University of Surrey 


Consider a supercritical Galton—Watson process (Z,,) with offspring 
distribution a member of the power series family, and having unknown mean 
8. The conditional asymptotic normality of the swtably normalized maximum 
likelihood estimator of 6 given the conditional information is established. The 
conditional information here is proportional to the total number of ancestors 
V,,, and it is also seen that this statistic is asymptotically ancillary for 9 in a 
local sense. The proofs are via a detailed analysis of the jomt characteristic 
function of (Z,, Va) and the derivation serves to highlight the difficulties 
involved ın establishing such conditional results generally. 


1. Introduction. This paper is concerned with asymptotic conditional in- 
ference for the offspring mean 0 in a supercritical Galton—Watson branching 
process. The branching process with unknown mean is an instance of a non- 
ergodic statistical model, where the appropriately normed sample Fisher infor- 
mation, W,,(@) say, converges to a nondegenerate random variable W, rather than 
to a constant; see for example Basawa and Scott (1983). Under suitable regularity 
conditions it can be shown that, for such a model, (X„(0), W,(@)) converges in 
joint distribution to (Z, W) where X,(@) is the appropriate randomly normed 
maximum likelihood estimator (m.l.e) 6, and Z is a normal random variable 
independent of W [Sweeting (1980) and Basawa and Scott (1983)]. This result 
suggests that (a) some statistic V, related to W,(@) might be regarded as 
asymptotically ancillary for @, since the asymptotic distribution of W,(@) is 
continuous in 9, and hence effectively constant over the main range of variation 
of the distribution of 6, and (b) the conditional sampling distribution of X,,(@) 
given V_, whose use would be dictated by the conditionality principle, would still 
be asymptotically normal. 

If this is the case, then approximate confidence intervals for 0 based on this 
conditional distribution will coincide with approximate Bayesian h.p.d. intervals. 
Similar remarks have been made in Sweeting (1978, 1980) and amplified in Feigin 
and Reiser (1979). As noted in Sweeting (1982, 1983), however, a rigorous 
verification of (a) and (b) would appear to be far from easy in general, although 
one would expect such results to be true for many cases of interest. 

It should be noted that the approach considered here is not the same as that 
considered by Keiding (1974) for the birth process, and later more generally by 
Basawa and Brockwell (1984). They condition on the limit random variable W, 


Received April 1985; revised July 1985. 

AMS 1980 subject classifications. Primary 60380, 62F12. 

Key words and phrases. Asymptotic conditional inference, nonergodic models, supercritical 
branching process, maximum likelihood estimator, asymptotic ancillarity. 


925 


926 T. J. SWEETING 


and then treat the unobserved value w of W as a nuisance parameter, which is 
then estimated, via V, for instance. This approach has the attraction of reducing 
a nonergodic model to an ergodic one; the final result still does not tell us 
whether asymptotic normality holds conditional on V,,, however. 

In this paper we obtain the conditional limit theorem in the case of a 
supercritical Galton—Watson process with unknown offspring mean 6, when the 
offspring distribution is a member of the power series family of distributions. The 
“related” conditioning statistic here is the total number of ancestors V,, which is 
proportional to the conditional information. Moreover, the results imply that V, 
is asymptotically ancillary for 6 in a local sense. The derivation, which follows 
the detailed analysis in Dubuc and Seneta (1976), highlights the difficulties 
involved in establishing such conditional limit theorems more generally. The 
problem is rather more difficult than, and in fact includes, the problem of 
establishing a local limit result for V,,. 

There have been a number of articles recently concerning approximate ancil- 
larity and conditionality, mainly pertaining to independent samples. Much of 
this work has developed from the papers by Efron and Hinkley (1978) and Cox 
(1980); see also Hinkley (1980), Barndorff-Nielsen (1980), Ryall (1981), Amari 
(1982), and McCullagh (1984). In particular, the construction of approximate 
ancillary statistics based on the observed information has received much atten- 
tion. 


2. Preliminaries and statement of results. Let (Z, = 1, Z,,...,Z,) bea 
sample of successive generation sizes from a supercritical Galton—Watson process. 
We assume that the nondegenerate offspring distribution ( p,) is a member of the 
power series family, p, = f,A/{F(A)}~* where A > 0, f, > 0, and F(A) = E, fN. 
We suppose that f, = 0, so that 0 = E(Z,) > 1, and assume o? = 0%(6) = 
Var(Z,) < œ. The maximum likelihood estimator 6, of 6 is given by 


§6,=1+V,"(Z,-1), 


where V, = X722, is the total number of ancestors (Heyde, 1975). Moreover, the 
conditional information is o~*V, (Heyde, 1975), and 


T(0) =(0"—1)°'(0-1)V,>W as. 
It is shown by Basawa and Scott (1976) that 
(1) (X,(8), T,(8)) > (Z, W) 


in joint distribution, where X,(6) = V:/(6,— 0) and Z ~ N(0, o°) indepen- 
dently of W. Furthermore, the convergence in (1) is locally uniform in #> 1 
[Sweeting (1978, 1980)]. 

As discussed in the previous section, if V, could be regarded as “asymptoti- 
cally ancillary” for 6, then the conditionality principle would dictate basing 
inferences about 8 on the conditional distribution of 6, given V,. Moreover, V, is 
a prime candidate for such a conditionality resolution as it directly affects the 
precision of 6; see for example Efron and Hinkley (1978) and the related 


discussion. A statistic V, is often said to be approximately ancillary if its 


ASYMPTOTIC CONDITIONAL INFERENCE 927 


distribution is approximately free from 8. In practice, this usually comes down to 
checking that the density of V, is approximately free from 6, via an Edgeworth 
expansion, for example, in the independent case. The definition of asymptotic 
ancillarity however should really be based on the asymptotic behaviour of the 
density or, as in this case, the probability mass function (p.m-f.) of V,, since it is 
the lack of information contained in the observed value of V, which is relevant. 
One possible formulation of asymptotic ancillarity in a local sense of V, here is 


; fg POL Pally) _ 
2) eer a aaa 


whenever |6, — 0| < A6~'/" and v,/c, > w for any A > 0, w > 0, where c, = 
c,(9) = (0 — 1) 6” — 1). [It should be noted that in Cox (1980) the phrase 
“local ancillarity” refers to the behaviour of the distribution in a neighbourhood 
of the true value of @.] A very similar criterion to (2) was also used in Section 5, 
Chapter 4 of Basawa and Scott (1983) while investigating the efficiency of 
conditional tests in mixed exponential families. 

Borrowing the notation in Dubuc and Seneta (1976), we shall say that the 
process is of type (L,r) if L is the greatest integer for which the offspring 
distribution is defined on a lattice {kL + r: k = 0,1,...}. We prove the following 
result. 


THEOREM. (i) V, is asymptotically ancillary for 8 in the sense of (2), and (ii) 
if (v,) is a sequence of integers such that v, = L*_jr/(mod L) and v,/c, > w > 0 
n 


v'/(6,- 0)|V,=0, > Z 
in distribution, where Z ~ N(0, o°), uniformly in compact intervals of 6 € (1, œ). 


The proof is via a detailed analysis of the joint characteristic function of 
(Za Va), along the lines of Dubuc and Seneta (1976). As a by-product, the local 
limit theorem for V, is established, which implies the asymptotic ancillarity 
of V,. 

Write S, = (Z, — 1) — (8 — DV, and let U, = c} S, = T)/*X,, (suppressing 
the parameter 6). Define the characteristic functions ¢,(x, n) = E(e%Sst7%)), 
Wall, E) = E(ethuet*l)) = p (e7 fk, e3 1t). From (1) and the continuous map- 
ping theorem we have 


(3) PnC, E) > YCE, E) = Eetu tew) 

uniformly in finite rectangles, where U = W'/*Z. Furthermore, ẹ(¢, £) = 
Ee'tWE(e" W ZIW) = g((o)? — i£) where g(s) = E(e™®™). Let 2%) = 
E(e%U|V, = v) be the conditional characteristic function of U, given V, = v. 
The relationship between ?(¢) and the joint characteristic function is given by 


Wr(S)Pa(o) = piv), 
where 
T/L 


pilo) = (L/2m) f7 ey, cnf) dë. 


e 
—a/L 


928 T. J. SWEETING 


See for example Bartlett (1938). Clearly p°(v) is the p.m.f. of V,. The proof 
consists essentially of showing that L~'c, p{(v,) > p(w) for an appropriate 
sequence (v,) with cz», > w > 0, where p(w) = (1/27){?,,e7 y(t, £) dé, 
from which it will follow that w°({) > ¥"(¢) = E(e™ |W = w). 


3. Lemmas. We need a number of results concerning the joint characteristic 
function ¢,(x,7) of (Sp, Vp). Let H,(s,t) = E(s**t”*), n> 1, be the joint 
probability generating function of (Z,, V,,). These generating functions are recur- 
sively related by 


(4) H,(s, t) = #(H,_,(s,t)), n21, H,(s,t)=s, 


where f(s) = E(e*:) [Jagers (1975)]. Define K(z, 9) = e“f(z) where @ is real 
and z complex with |z| < 1. Let K,(z, @) be the nth functional iterate of K(z, 0) 
in the first argument: that is, K,(z,4) = K(K,_,(z, @), 0), n 2z 1, where we 
have set K,(z, 8) = z. It follows that H,(s, e'”) = K,(s,), and so (x, n) = 
H,(e'*, e(-@- U0) = e~ XK (e'%, n — (8 — 1)x). We shall need the following 
estimates. 


LEMMA 1. There exists p with p, <p <1 such that for all p 21, |2l, 
jz'| < R <1 andall n,n one has 


(5) |K (z, n)| < App? 
and 
(6) |K (2,0) - K,(2’,1’)|S BRo?(iz - zi + In - v). 


Proor. Choose A sufficiently small for p = f'(h)< 1. If 0< R <1 then 
f,(R)10 where f (s) = H,(8,1), and so there exists an integer N such that 
f,( 2) < h for all p > N. Then if |z| < R < 1 and p > N, 

|K,(z, n)|= |K(K,_,(z, n), n) | S | f(z) | |K,-:(z, n) |, 


where |z,| < |K,_,(z,)| < h, since K(0, n) = 0. Thus |K,(z, n)| < 
pPTN]K p(z, n)| and (5) follows. 
For (6), write 6, = |K (z,n) — K,(2’, m)l; then if p > N 
8, <| f(Kp_s(z,0)) - F(Kp-s(2’, 0) | +le"-™ — 1] |K,(2’, 1) | 
p81 + [n — n'|Agp” 


from (5). Iterating, one arrives at 5, < p?~“(8y + AR — p)`'n — 7’). Finally, 
it is readily verified that (0/dz)K (2, n)| < C, and |(8/3n)K p(z, n)| < C; for 
all |z| < 1 and all y, so that 6, < C,|z — z2’| + C,|n — n'| and (6) follows. O 


Lemmas 2-7 mirror results in Section 2 of Dubuc and Seneta (1976) for the 
characteristic function of Z,. 


Lemma 2. For all e>0 
sup{|$,(x,0)|: n = 1, Ix] < T/L, e/c, < |n| <7/L} <1. 


ASYMPTOTIC CONDITIONAL INFERENCE 929 


Proor. From (3), ,(§, £) > g((of)*® — ig) uniformly in finite intervals of R°. 
Clearly |g((of)? — i£)| < g((of)”), and |g(—ig)| < 1 if € + 0, as in the proof of 
Lemma 1 in Dubuc and Seneta (1976). Since g(s) is a decreasing function of 
s > 0 it follows that sup{|y(f, €)|: § € R} < 1 for each ¿$ + 0. Thus there exists 
r € (0,1) and N > 1 such that |¥,(0,6)| sr if n>N and e s< |$] < (1 + @e. 
Then if e/c, < |n| < €/c,_, we have |¢,(x,7)| <r for k > N since c,/c,_, < 
1 + 0, and hence for this range of values 


len(x,7)|=|K,(e%, n- (8 - 1)x)|<|K,(e% n- (0 - 1)x)| 


=|¢,(x,1)| <r. 


Consider finally the region e/c,_; < |n] < 7/L. When 0 < ô < |y| < 7/L there 
exists S(5) < 1 such that |f(e'%)| < S(6). Thus |¢,(x, m)l < lox MI = 
|f(e™*)| < S(8) < 1 provided 0 < ô < |x| < 7/L. When x =0 we have 
\,,(0, DI < I0, )| = | Fe] < S(e/en_,) < 1 and it follows by continuity that 
sup{|¢,(x, n)l: Ix] $ 7/L, &/ey_, < |n| < */L} < 1 as required. 0 


In a similar way to Dubuc and Seneta (1976) we define the sequence of 
intervals 
J, = {ni aL ey! < jns Loch} ak > 1. 
Lemma 3. For alln > k, |x| < 7/L, n € J, k> 1 there exists a constant A 
such that 
ln(x n) | < Ap™*. 


Proor. We have 
le,(x,9)|=|Ka(e™, 1 — (0 - 1)x)| 


=|K,_,(e%4(x,0),0 — (8 ~ 1)x)| 


and by Lemma 2 there exists R < 1 such that |¢,(x, 7)| < R for all |x| < 7/L, 
n € d,, k > 1. The result now follows immediately from inequality (5) of Lemma 
10 


Dubuc and Seneta (1976) show that g(s) is a lower bound for /,(e~°/°r) for all 
s € (0,1), where /,(s) = H,(s,1). We require a similar bound for h,(e~‘/*), 
where h,„(t) = H,,(, t). Establishing such a bound requires a little more work, 
but one does arise as a consequence of the bound for f,(e7~°/*"). 


LEMMA 4. There exist numbers a > 0, r < 1 such that for all t € (0,7) and 
alln > 1, h,(e7'/) > g(at). 


Proor. Write P(s) = —log f(e~*). Then P(s) is an increasing concave func- 
tion, P(0) = 0, lim,..,,P(s) = œ, and P (s) = —log f,(e~*) is the nth func- 
tional iterate of P(s). Write Q,(t) = —logh,(e~‘); then from (4), @,(f) = 


930 T. J. SWEETING 


P(Q,,.(£)) +£ n 21, from which it follows that Q,_,(f) + £ < Q(t), since 
Ps) > 1 for all s > 0. Let £, > 0 and let A = P(s)) > 1. Then if u + t/A < 8, 
a first-order Taylor expansion gives P(u) + t< P(u + t/A). Now choose £ so 
small that Q(t) < sọ. It then follows that Q,(t) = P(Q„- (£) + t < P(Q,-4 
(t) + t/A) since Q,_,(t) + t/A < Q,_ (4) + t < Q(t) < 8. Iterating, one finds 
that 


Qn (t) < P(E) + EATI + +++ 447") < Pat), 


where a = X/(A — 1). Finally, since Q,(t¢/c,,) > —log g(t) one can choose t so 
small that Q,(t/c,) <8 for all n>1 and 0<t<t, giving Q,(t/c,) < 
Pat/c,). Therefore h,(e~'/) > f(e) > glat) if 0O<t<a™', from 
Lemma 5 in Dubuc and Seneta (1976), and the result follows on choosing 
7T=min(a™',¢,). O 


LEMMA 5. There exists a function V* defined on (0, 0) such that (a) V* is 
slowly varying as x > 0 + , (b) V* is bounded on every interval (e, œ), e > 0, 
and (c) for all |x| < n/L, n,n E€ J, with |n — y'|< 8 andalln=k 


lonx n) — balx n) | < côp” *V*(c,8). 
Proor. We have 
lon(x» n) — alx 9')|=|K,(e%, 0 — (8 - 1)x) - K,(e%, 9’ — (8 - 1)x)| 
=|K,-s(2, 8) — Kn-a(2’, B’), 


where z = e'%$,(x,7), 8 =n — (0 — 1)x, etc. From Lemma 2 there exists R < 1 
such that |¢,(x, 7)| < R for all n > k, n E J,, and Lemma 1 now gives 


(7) le.(x.0) — ox, n) | < Bro” *(lo.(x, n) — ba(x, n) l+ 8). 
But 
lolx 0) — a(x, 1’) | =| E (ex7 Y (e= — 1))| 
< Ejer — 1| < C(1 — h,(e7°)) 


as in the proof of Lemma 6 in Dubuc and Seneta (1976). But if c,6 < r then 
Lemma 4 of that paper and Lemma 4 here give 


1—h,(e~*) < acô V( cð), 


where V(s) = (1 — g(s))/s is slowly varying as 8 > 0 +. Thus, |¢,(x, 7) — 
,(x, 1')| < c,5V,(c,5) where V(x) = CaV(x), x < t and V(x) = C, x > r. With 
a suitable choice of V*(x) the result follows from (7). O 


Lemma 6. Let $({,&) be the joint characteristic function of (X,Y), and 
suppose that 


(1/27) f” o(8, ge de 


converges locally uniformly in y to a function p*( y) (necessarily continuous in y) 


ASYMPTOTIC CONDITIONAL INFERENCE 931 


for every fixed { and any fixed sequence of positive numbers t, > œ as n > ©. 
Then py) is the density of Y, and py) = 9$(S)p(y) where $$) = 
E(e®*|y = y). 


Proor. Consider the complex measure dF( y) = ¢>({) dF( y), where F is the 
distribution of Y. Note that F is of bounded variation, and that the characteris- 
tic function y? of FS is given by y£) = fe4#6(¢) dF( y) = $(&, £). Then, as in 
Dubuc and Seneta (1976), it is seen that {°p'(y) dy = F(a, b) at continuity 
points of FS, the proof applying without change to a complex measure. Thus 
p(y) is the density of F. Setting ¢ = 0, it follows that p(y) is the density of Y, 
and hence p*(y) = $>({)p°(y) as required. O 


LEMMA 7. 
(1/20) f7 emt yg(& €) a > HS) (uw) 


locally uniformly in ¢ € R and w > 0 where p(w) is the continuous density of W, 
and y"(§) = E(e4"|W = w). 


Proor. The argument here is identical to that given in Dubuc and Seneta 
(1976) on taking K„(¢, w) = e~*y(§ €), q = 0 and using Lemmas 3, 5, and 6 
here in place of their Lemmas 3, 7, and 8, and we omit the details. It is only 
necessary to note that c7'/*|f| < m/L for all n sufficiently large for the applica- 
bility of Lemmas 3 and 5. O 


4. Proof of the conditional limit theorem. If Z = X + Y where X, Y are 
two lattice random variables such that the distributions of X and Y|X = x have 
period L, then it is easily seen that the distribution of Z must also have period 
L. It follows that the distribution of both Z, and V, have period L for all n > 1. 
The possible values of V, are easily seen to be among L”7r'(mod L) and hence if 
(v,) i8 any sequence of positive integers such that v, = L27jr'(mod L) and 
lim, Pr/Cn =w>0 then we have [cf. Bartlett (1938) and Steck (1957)] 
Y2n(S) pO) = sie and 


ac, /L 
eppil on) = (1/20) fem Kon/endey,( £) dẹ. 
>C / L 
But now from Lemma 7, L~'c, p}(v,) > y(¢)p(w) locally uniformly in ¢ € R. 
We have therefore shown that w2({) > $%({) = E(e"¥|W = w) locally uni- 
formly in SER, 
Finally, 
E(e®%aV, = v,) = r OA F va) 
> E(es” UW = w) = E(e#) 


and the convergence in (ii) follows for each 0 > 1. For (i) and the uniformity in 


932 T. J. SWEETING 


(ii) we need to check that the convergence in Lemma 7 is uniform in compact 

intervals of (1, œo). The convergence y,(f, £) > ({, £) required in Lemmas 2 and 

7 is uniform in compacts of (1, co) from the uniform convergence in (1). Finally, 

the choice of constants p, Ap, and Bp in the bounds (5) and (6) may be made 

independently of 8 in compact intervals of (1, œ) for reasons of continuity. 
Taking ¢ = 0 we see that 


PCV, = %18,) _ Ley) P%(O18,) Cal) 
PCV, = 010) L'e (0) P8418) cu 8,) 


provided c,(8,)/c,(@) > 1 as mn — oo, which will be the case if and only if 
6, — 6 = o(n7~}), and (2) follows. O 





5. Concluding remarks. For ease of exposition, only the case X, = 1 was 
treated here. It is a relatively straightforward matter to show that the theorem 
remains true in the general case when X, =J > 1. [In the case of an arbitrary 
initial distribution (a,, j > 1), then usually one would want to condition on the 
observed value Xp, provided of course that (a,) is independent of @.] 

A more substantial generalization would be to relax the assumption that 
Po = 0. In this case, it will be necessary to argue conditionally on nonextinction of 
the process by time n. (It is inappropriate in the author’s view to condition on 
nonextinction of the entire process, as this information is never actually availa- 
ble.) This would require an extension of the arguments given here to the case 
where p, > 0, and this has not been attempted. Note however that since 
P(X,, > 0|8) > 1 — (8) locally uniformly in 0 where q(@) is the probability of 
ultimate extinction, and q(@) is continuous, the event {X,, > 0} is asymptotically 
ancillary in the sense used here. 

A referee has pointed out that there must be a connection between the 
asymptotic ancillarity of V, as defined here and a concept of asymptotic 
ancillarity defined in terms of Fisher’s information contained in V,,. Specifically, 
let [,(@) be Fisher’s information in the observed process up to time n and 
Iy{0) = E(k, (9)]? where &,(8) = log P(V, = v,|9). Then V, is asymptotically 
ancillary in this sense if T,(9)/I,(8) > 0 as n> œ. Indeed, in the present 
problem this ratio is of order n287". In the case of independence, Amari (1982) 
defines higher-order asymptotic aneillan ty essentially in terms of the order of the 
corresponding ratio of information functions. It can be seen informally that the 
two approaches to asymptotic ancillarity are very close, as k,(@ + 6171/7) — 
k,(0) = 81, '/*k}(0). Thus the convergence of Iy /I, to zero will usually entail 
(2) and vice versa. For a formal result it will be necessary to impose further 
conditions on the sequence I7'/7k/ (6), and the question is not pursued further 
here. Nevertheless, the close connection between the two concepts is illuminating. 
The rate at which the ratio (2) converges to one is a measure of the degree of 
ancillarity; in our case it is O(n”), which is the same order as (I, /I,,)'”. 


Acknowledgment. The author wishes to express his thanks to the referees 
for their constructive comments—in particular those relating to the concept of 
asymptotic ancillarity. 


ASYMPTOTIC CONDITIONAL INFERENCE 933 


REFERENCES 


AMARI, S. (1982). Geometrical theory of asymptotic ancillarity and conditional inference. Bo- 
metrika 69 1-17. 

BaRNDORFF-NIELSEN, O. (1980). Conditionality resolutions, Brometrika 67 293-310. 

BARTLETT, M. S. (1938). The characteristic function of a conditional statistic. J. London Math. 
Soc. 13 62-67. 

Basawa, I. V. and BROCKWELL, P. J. (1984). Asymptotic conditional inference for regular non- 
ergodic models with an application to autoregressive processes. Ann. Statist. 12 161-171. 

Basawa, I. V. and Scorr, D. J. (1976). Efficient tests for branching processes. Biometrika 63 
531-536. 

Basawa, I. V. and Scorr, D. J. (1983). Asymptotic Optimal Inference for Nonergodic Models. 
Lecture Notes in Statist. 17. Springer, Berlin. 

Cox, D. R. (1980). Local ancillarity. Biometrika 67 279-286. 

Dusuc, S. and SENETA, E. (1976). The local limit theorem for the Galton—-Watson process. Ann. 
Probab, 4 490-496. 

Erron, B. and HINKLEY, D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: 
Observed versus expected Fisher information (with discussion). Biometrika 65 457-487. 

FEIGIN, P. D. and REISER, B. (1979). On asymptotic ancillarity and inference for Yule and regular 
nonergodic processes. Biometrika 66 279-283. 

HEYDE, C. C. (1975). Remarks on efficiency in estimation for branching processes. Biometrika 62 
49-55. 

HINKLEY, D. V. (1980). Likelihood as approximate pivotal distribution. Biometrika 67 287-292. 

_ JAGERS, P, (1975). Branching Processes with Biological Applications. Wiley, New York. 

KEIDING, N. (1974). Estimation in the birth process. Biometrika 61 71-80. 

MCCULLAGH, P. (1984). Local sufficiency. Biometrika T1 233-244. 

RYALL, T. A. (1981). Extensions of the concept of local ancillarity. Biometrika 68 677-683. 

Steck, G. P. (1957). Limit theorems for conditional distributions. Uni. Calif. Publ. Stat. 42 
237-284. : 

SWEETING, T. J. (1978). On efficient tests for branching processes. Biometrika 65 123-127. 

SWEETING, T. J. (1980). Uniform asymptotic normality of the maximum likelihood estimator. Ann, 
Statist. 8 1375-1381. 

SweeEtina, T. J. (1982). Correction note to “Uniform asymptotic normality of the maximum 
likelihood estimator.” Ann. Statist. 10 320-321. 

SWEETING, T. J. (1983). On estimator efficiency ın stochastic processes. Stochastic Process. Appl. 15 
93-98. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF SURREY 
GUILDFORD SURREY GU2 6XH 
ENGLAND 


The Annals of Statistics 
1986, Vol. 14, No. 3, 934-953 


NONPARAMETRIC BAYESIAN REGRESSION 


By DANIEL BARRY 
University College, Cork 


It is desired to estimate a real valued function F on the unit square 
having observed F with error at N points in the square. F' is assumed to be 
drawn from a particular Gaussian proceas and measured with independent 
Gaussian errors. The proposed estimate is the Bayes estimate of F given the 
data. The roughness penalty corresponding to the prior is derived and it 18 
shown how the Bayesian technique can be regarded as a generalization of 
variance components analysis. The proposed estimate is shown to be con- 
sistent in the sense that the expected squared error averaged over the data 
points converges to zero as N — oo. Upper bounds on the order of magnitude 
of the expected average squared error are calculated. The proposed technique 
is compared with existing spline techniques in a simulation study. Generalisa- 
tions to higher dimensions are discussed. 


1. Introduction and summary. Let (y,,x,), i= 1,2,..., N, satisfy 
Yı = F(x,) + € 
where x, € [0,1] x [0,1] for each ¿ F is a fixed but unknown regression 
function, and the errors {e,} are uncorrelated with mean zero and variance v. 
This paper concerns estimation of F. The classic example of this situation is 
where X = (X,, X,) specifies coordinates on a plane and Y is a measure such as 
height above sea level. Our aim is to use the data to construct a map. 

Wahba (1978) considers the above problem when x, € [0,1], 7=1,2,..., N. 
Later in Wahba (1979) she extended the technique to cover estimation in higher 
dimensions. Here we consider another possible generalisation to two dimensions 
of Wahba’s work and claim that it fits more satisfactorily into the Bayesian 
formulation described in Wahba (1978). 

Motivated by a decomposition often used in two-way analysis of variance we 
can write 

F(x,,%_) = p + a(x,) + B(x) + ¥(x1, x2), 
where 


p = [ f Flu, v) dudo, 

a(x,) = f'F(x,,0) do -= p, 

Bx.) = f F(u, z2) du — p, 
YEr 22) = F(a, x3) — a(x) — B(x2) ~ n- 


Received January 1984; revised November 1985. 
AMS 1980 subject classifications. Primary 62G05, secondary 62405, 62M99. 
Key words and phrases Bayes estimate, Brownian sheet, roughness penalty, consistency. 


934 


NONPARAMETRIC BAYESIAN REGRESSION 935 


A prior for F is constructed by putting independent priors on p, a, 8, and y as 
follows: 


(i) a ~ N(O, v9). 
(ii) a(x,) ~ Z,(x,) — f¢Z,(u) du where Z, is a Brownian motion with vari- 
ance 0). 
(iii) B(x) ~ Z,(x2) — foZ,(u) du where Z, is a Brownian motion with vari- 
ance vp. 
(iv) ¥(%1,%2) ~ Z(X,,X_) — foZ(x,,u)du — fdZ(u,x,)du + fif{Z(u, v)dudv, 
where Z is'a Brownian sheet with variance 0,0. 


To complete the specification of the probability model required for a Bayesian 
analysis we assume (v) given F, the data { y,} are independent with 
y,~ N[F(x,), o], i=1,2,..., N. 


The proposed estimate of F is then the limit of the Bayes estimate of F as 
Up > %, Le., 


Bases) = lim E,( F(x, xa), Y2s-. IN)» 


where E,, is expectation with respect to the posterior density resulting from the 
probability model defined in (i) to (v) above. 
In Section 2 we indicate how the above prior came about as a generalisation of 
the one-dimensional priors described in Wahba (1978). Section 3 demonstrates 
that the suggested Bayesian analysis corresponds to choosing F to minimise 
N 
2 (y, a F(x,))” + P(F), 
1al 

where 


P(F) = a [|SP j do} as, + cof | [alu te) du| de, 


ipl 
tea] f Flu, v)’ dudo, 


F, F, are partial derivatives, F, is the second-order mixed partial derivative, 
and c, = 0/0, Co = 0/Vz, Cig = 0/212. The possibility of regarding the technique 
as a generalisation of variance components analysis is examined in Section 4. 

Asymptotic properties are considered in Sections 5 and 6. These properties 
depend only on the assumption that the errors are uncorrelated with mean zero 
and constant variance. The symbol E will be used throughout to denote expecta- 
tion with respect to these assumptions. Thus the expectations have a frequentist 
interpretation since they can be interpreted without reference to the prior; from 
a Bayesian viewpoint they are expectations conditional on F. In Section 5 we 
prove: 


THEOREM. Define 


RCP, F) = = D (Alx,) ~ Fa). 


te] 


936 D. BARRY 


Then, under certain smoothness assumptions on F, 


ER(Ê, F) < SPF) + (z) 


12 vb, ù Vg 





4 
Hence ER(Ê, F) > 0 as N > œ provided 


v 70, No >o, v0, 70, No > w, va >00, Nop > 0. 


N v v v 


In Section 6 we consider data on a grid and prove: 


THEOREM. Suppose we have N = mn observations on a grid 
[lx xa) l sism,1<j<n}. 
Then, under certain smoothness assumptions on F, 
B (v, + vig) 1 


ER(F, F) < E(P) + o| r 


B,(v. + 1/2 
ro|( Bete) | + O[(B,Byp,2)'I0g(B,B,>.)), 
where B, = max(x,, ~ x,,_,), B = max(X»,— X2)~1)- 


This theorem provides more precise rates of convergence in the grid case. The 
choice of values for c,, c3, and C, in a particular application is difficult and is 
typical of all nonparametric estimation techniques. In the simulation study 
described in this paper Wahba’s method of generalised cross validation [see 
Craven and Wahba (1979)] was used to choose values for c,, c,, and ej. It 
remains to be shown that the values so obtained have the orders of magnitude 
suggested by the asymptotic calculations. 

Wahba’s two-dimensional regression technique is described in Section 7 and is 
compared with the Bayesian technique in a simulation study reported in Section 
8. Finally, generalisations of the Bayesian technique to higher dimensions are 
described in Section 9. 


2. The Bayes estimate. Wahba (1978) considers the problem of estimating 
a regression function F: [0,1] > R given data (x, y,), i = 1,2,..., N, with 
Yı F F(x,) t £, 
where {e,} are iid N(0, v). A prior for F is constructed by writing 
F(x) = p + a(x), 
where p = F(0), a(x) = F(x) — F(0) and then putting independent priors on p 
and « as follows: 


(i) a ~ N(O, vo). 
(ii) a(x) ~ Z(x) where Z is a Brownian motion with variance v,, i.e., Z is a 


NONPARAMETRIC BAYESIAN REGRESSION 937 


mean-zero Gaussian process with 
Cov[Z(x), Z(y)] = v,min(x, y). 
The estimator F is defined by 
Aa) = lim B,(F2) 0 30 Sah 


where E, is expectation with respect to the posterior density generated by the 
prior and the normality assumptions on the errors {e,}. Wahba shows that this 
procedure is equivalent to choosing F absolutely continuous to minimise 


Ely- F(x,)) + of P(x) dx, 


where c = v/v,. She further describes the equivalence between roughness penal- 
ties of the form {{#("(x)? dx and corresponding Bayesian priors which for 
m > 1 are integrated Wiener processes [see Shepp (1966)]. Here we generalise the 
m = 1 prior to the situation where we have data in two dimensions and follow 
Wahba in defining the estimator. 

One possible generalisation of Brownian motion to two dimensions is called a 
Brownian sheet. This is a Gaussian process Z indexed by {(x,, x3): x, > 0, 
X, > 0} and satisfying 

EZ(x,, x3) = 0, 
Cov[Z(x,, Xo), AW, »)] = min(x,, y;)min(x,, y2). 


See Zimmerman (1972). 
The one-dimensional prior was based on the decomposition 


F(x) = F(0) + (F(x) - F(0)) 
with a Brownian motion prior on the term F(x) — F(0). Consider a similar 
decomposition here, 
F(x,, x2) = F(0,0) + (F(x,, x2) — F(0,0)), 
with a Brownian sheet prior on F{x,, x.) — F(0,0). This is not satisfactory since 
with probability one a Brownian sheet is constant along the lines x, = 0 and 


x, = 0. We do not want to impose such conditions on our estimator. 
Alternatively we might consider the following four-way decomposition: 


F(x,, x2) = F(0,0) + [F(x,,0) — F(0,0)| + [F(0, x.) — F(0,0)] 
+[F(x,, x2) + F(0,0) — F(x,,0) — F(0, x3)]. 
We could then put independent priors on each part: F(0,0) ~ N(0, vo), Brownian 
motions on [F(x,,0) — F(0,0)] and [F(0, x.) — F(0,0)] and a Brownian sheet on 
(F(x, x2) + F(0,0) — F(x,,0) — F(0, x3). The problem with this approach 
however is that it imposes different smoothness conditions on the function values 
along the axes through zero. 


This can be seen by applying the argument to be given in Section 3 to this 
Bayesian model to show that the Bayes estimate in this case corresponds to the 


938 D. BARRY 


use of a roughness penalty of the form 
1 1 1f 
By f F,(x,0)° de + By f Fa(0, y)? dy + Baf f Fia(2, y} dydx, 
0 0 0 70 


where F, F, are partial derivatives, Fis is the second-order mixed partial 
derivative, and £., Bo, i2 are constants depending on the error variance v and 
the variances associated with the elements of the prior specification. There is 
rarely a prior justification for this. 

To avoid this difficulty we propose the following decomposition: 


F(x,, x3) = u + a(x,) + B(x) + y(x,, x2), 


where 


u= f F(u, v) dude all integrals have range [0,1], 
a(x,) = [F(x,,0) do- f fE(u, v) dude, 
B(x) = [F(u, x.) du — f [F(u, v) dude, 


y(x1,%_) = F(x, x) - [F(x, ») dv — [Flu x») du + f fF, v) dudo. 
We proceed to put independent priors on p, a, 8, and y as follows: 


1. E si N(0, vo). 
2. a(x,) ~ Z(x,) — [Z(u)du, where Z (u) is Brownian motion with variance 


v, 
3. B(x) ~ Zx) — [Z(v0)dv, where Z,(v) is Brownian motion with variance 
Vg. 
A, Wai X2) ~ Z(ay £2) — [Z(u, x) du — [Z(£o, v) dv + ffZ(u, v) dudo, where 
Z(u, v) is a Brownian sheet with variance v9. 
The following results may be easily shown: 
1. Cov(a(x), a(y)) = v,[min(x, y) — x — y + (x? + y?)/2 + 4] 
= v, h(x, y), say. 
2. Cov( B(x), B(y)) = 0,h(x, y). 
3. Cov(y(%1, £2), YM» Yo)) = Vigh(x1, W)A(X2, Ye). 
In summary, therefore, the prior specification is the same as the distribution of 
the process 
B + R(x, X), 
where u ~ N(0, vo), and R(u, 0) is a mean zero Gaussian process with 
Cov( R(x, x2), R( 1, ¥2)) = r12(x1, 31) + v2h(x2, y2) 
+0 2h(%X1, ¥,)A(X2, y2) 
= Q(x,y), say. 
We now assume that we have observations (x,, y,), i = 1,2,..., N, with 
x,= (Xis Xa) E [0, 1] x [0,1] 


NONPARAMETRIC BAYESIAN REGRESSION l 939 


and 
n= AX) +e, 
with {e,} iid N(0, v). The proposed estimator is defined by 
F(x;, x2) = Jim EFC, xo)ln, Yrs- yw} 


where E,, is expectation with respect to the posterior density generated by the 
above prior and the normality assumptions on the error terms. 


3. The roughness penalty. For a function F: [0,1] x [0,1] > R, let F, 
denote the partial derivative with respect to x,, F, the partial derivative with 
respect to x), and F,, the second-order mixed partial derivative. 

Let H consist of all functions F: [0,1] x [0,1] > R satisfying the following 
conditions: 

(i) F is absolutely continuous. 

(ii) F, is an absolutely continuous function of x, for each x, in [0,1]. 

(iii) F, is an absolutely continuous function of x, for each x, in [0,1]. 

(iv) Fia E€ L*[0,1)?, i.e., [ofoFZ(u, v) dudv < œ. 


Then we have the following theorem. 
THEOREM 3.1. The Bayes estimate generated by the prior of Section 2 is the 
unique element in H minimising 
fe : 
(3.1) E (%- REN + P(F), 


m1 


where 


P(F) = af’ | [Fle v) do] a, $ caf | [Fius a) du] dr, 
ten f | Falu, v)” dudo 
with c, = v/0,, C3 = 0/02, and Cig = 0/1». 
Proor. For F,G e H define 
(F,G) = [f fF, o) duao|| f falu, v) dude] 
+a, f| fE, u) du || fen, v) do| dx, 
+ of | fau, za) du|| fev, Xe) do| dx, 


+a f [Fou v)G (u, v) du dv, 


where a, = 1/0, a, = 1/v,, and ay. = 1/d 4p. 


940 D. BARRY 


It can be shown that ( - ,-) is an inner product for H. The only difficulty is in 
proving that (F, F) =0 > F = 0. Now 


(F, F)=0 = f [F}=0 
= F=0 ae. 
= F is a function of x, alone and F, is a function of x, alone. 


2 
(F, F) =0 = {| fra.) a] dx,=0, F,=0 ae. 
Similarly (F, F) = 0 = F, = 0 ae. and hence (F, F) = 0 = F is constant. But 


(F, F) =0> ffF=0 


»F=0 ae. 

Since F is absolutely continuous we have F = 0. For x,y € [0,1] X [0,1] define 

$o(x) = 1, 
Q(x, y) = 0, h(x,, y1) + Ogh( x2, Yo) + vihar, %)A(X2, 2), 
where 
A(x, y) = min(x, y) — (x + y) + 4(x? + y?) +3. 

Then 

(1) by) € H; Q(x,-) €H forall x, 

(2) (9; Q(x, -)) =0 forall x, 

(3) (F, ġo + Q(x, *)) = F(x) forall F €H. 


(3) follows easily upon noting that 
['r(u, v) do =0 forallu 
0 


and that for any absolutely continuous function 


i a 1 
[VOE ») dy = v(x) - f YO) a. 
It follows from (1)-(3) that 
(Oy, °), Q(x, *)) = Oy, x). 
For any function F € H we can write 


F(x) = a¢,(x) + } 5,Q(x,,x) + e(x), 


t=] 


where 
($0, €) = 0, 
(Q(x,,*), e) = 0, i= 1,2,...,N. 


NONPARAMETRIC BAYESIAN REGRESSION 941 


Hence 
F(x,) = (F, $o E Q(x,, *)) 
N 
=a+ $ 5Q(x,,x,). 
J=1 
Also 


P(F) = 0(F, F) - o| f' fF) dudo] 
= v}, 2 b,b Q(x, x,) + ole, e). 


Thus to minimise (3.1) we choose e = 0 and a,b = (b, bo,..., by) to minimise 
(y — al — bQ)'(y - al — bQ) + vbQb’, 
where 1 is a vector of ones 
Q=(Q(x,,x,)), i 7=1,2,...,N. 
X in Lemma 5.1 of Kimeldorf and Wahba (1971) the minimising values are given 
y 
a = (VM) YM y, 
b = M[I - (VM) YM ]y, 
where M = Q + ol. That this formula also gives the Bayes estimate follows as in 
Theorem 1 of Wahba (1978). O 


Notes. 1. The form of the roughness penalty indicates how scale changes in 
the X-variables can be taken into account by adjusting the smoothing parame- 
ters, e.g., if we rescale X, to W, = a + Bx, then adjusting c, to c,/8 and cy, to 
Ci2/B leaves the roughness penalty unchanged. 

2. Let F = (Êx), Frx,),. ie Fyxy)). Then 3 a matrix A with 


f= Ay 
and A is given by 
A= (YMY) YM + QM |T- (vMi) ‘VM 
= QMT! + oM—4'1(1'M7!) YM}, 
where 
LH Oleg Nx1, 
M=Q+o0l, NXN, 
Q= (Q(x,,x,)), NXN. 


Evaluation of F therefore involves inverting the N x N matrix M. 
3. The completeness of H is left open; the method of proof used in Wahba 
(1978) does not depend on it. 


942 D. BARRY 


4. Data on a grid in two dimensions. In this section we derive a represen- 
tation for the estimator Ê when observations are taken on a grid in two 
dimensions. 

THEOREM 4.1. Suppose we have N = mn observations on a grid 

(ze): stsmlsj<n}, 
where 
Osx, <X.< t <x, 81, 
0 S Xo < Xa < ++: <2Xq, S81. 
Let F,, = F(x,,X2,) and define 
F = (Fy, Fase Baas Figs Foose., Bangs +++ Fins Ponse Fan) 


and define y similarly. Let 


ti 


a = (A(x, + X12); AEST = Xub- wt (Xt t tind 
b; = (t(x + ta), $223 ~ Ker), -e31 — HX an + Xana)’ 


For 1 < i < m define 
n 
a(F, i) = X a, Fy 
gel 
For 1 <j < n define 


AUF, j) = E bF we 


t=] 
Then the Bayes estimate Ê is chosen to minimise 


y-FYy-F+a,¥ bi ’? a ali 
wm li lel 


[A(F, j) - B(F, 7-1)]? 








th Y 


J]=2 Xa, — Xa)-1 
(F, +F, -ly-t Ps 1 =f, = y 


+da È È 


1=2 J=2 (xu — Xir-1)(£2; — %,-1) 
where 
d, = v/(v, + o,2b4Q.b,), 
d, = v/(v + 0,2844,8,), 


dip = 0/02, 


NONPARAMETRIC BAYESIAN REGRESSION - 943 


where 
Q: = (Gay); i, J = Ly 2ss m, 
Q: = (A(x2,, %2,))s i, J = 1,2,.. ‘n 


h(x, y) = min(x, y) — (x + y) + 4(x? + y?) +4. 


Proor. For j = 2,3,...,m let a, be a m X 1 vector of zeroes with 1 in the 
jth place and —1 in the (jJ — 1)st place. For j = 2,3,...,n define n X 1 vectors 


b, similarly. Define 
V,, = (a,x b )F. 
For the prior as specified in Section 2 it can be checked that 


(i) Cov[V,;, Va] = Oif i # ror j # 8; 
(ii) Var[ Vi] > vo; 
(iii) Var[ Va] = (41, — 4-1)(01 + 212b1Q2b;), i 2 2; 
(iv) VarlV,;] = (x2, ~ X2,-1)(02 + 012849181), J = 2; 
(v) Var[ V, ']= (£u Xy,-1)(%2, 7 Xj- Din i, J = 2. 
Using these facts we may rewrite the log posterior density and the result 


follows upon noting that 
Va = o(F, 2) -a(F,i- 1), 
Vi, = B(F, J) - B(F, j- 1), 
V2 8, + Fi- f, 


t-ly FF joa oO 
The following corollary is easily shown. 
COROLLARY 4.2. For the special case where 

= (2i — 1)/2m, xg, = (2 — 1)/2n, 


the estimate Ê is chosen to minimise 


y- ry- a g ee) 


1m2 Fu T Xiu- 


a (F, -F m F, + Faai- Ry Fy) 
safi J UE x | Sipel w-1ly ony) 


je2 2z T %az-1 1=2 J=2 (x1,- Xy-1)(%2, g X21) 





This demonstrates how the two-dimensional technique may be regarded as a 
form of analysis of variance incorporating smoothness assumptions on the un- 


derlying regression process or alternatively as generalised variance components 
analysis. 


944 D. BARRY 
5. Asymptotics: the general case. Define 
RCP, P) = 5 È (Pa) - FGD, 
Taking expectation with respect to the error distribution gives 
ER(Ê, F) = irq - A)F + ~trace( A*), 


where F = (F(x,), F(x,),..., F(xy)) and A is the N x N matrix such that 
F = Ay. 


THEOREM 5.1. For F € H we have that 


ER(P, F) < ee 6) do] a, 


Co pif sl 7 Cie fl pl 2 
ial | [Faus 2) du dx + N J, J Po) du dv 


Ao titir 
4N à Cy Cel 


Proor. By Lemma 4.1 of Craven and Wahba (1979) 


2 
F'(I -AF < a f| [F v) a| dx, 
0 L70 
Ty ft a ll 
tef | f Fau, xa) du dx, + caf f Flu, v) dudo. 


Hence we need only consider bounding trace(A?°). From the remarks after 
Theorem 3.1 we have that 


A=A,+E, 
where 
Ay = QM", E=eoM VM YM 
with 
Q = (Q(x,,x, ), M=Q+0l, 
for 
Q(x, y) = 0,A(x,, y1) + 0gh(x2, 32) + O12h(41, 1)A(X2, y2), 

where 


h(x, y) = min(x, y) — (x + y) +3(x? +y?) +4. 


NONPARAMETRIC BAYESIAN REGRESSION 945 


For matrices A and B we write A < B to mean that the eigenvalues of A are 
smaller than the corresponding eigenvalues of B. Clearly 0 < A, < I. We have 
that 


I-A=0M!-E 
= oM {I - MMMM, 
Since M~'/711'M~ > 0 and trace[ M711 M7 "2/1' M71] = 1 we have 


0 < MYAYM-/YM-Y <I 
=I-A>02A<I=>E<I. 


Hence 
trace( A?) = tr( A?) + 2tr( AE) + tr( E?) 
< tr( A2) +3 
since rank(A,E) < rank(E) = 1. Also 
trace( 42) = tr[QM- 1]? 


[Q] x \2 x 
< tr-—— as ( so 
4v x+o 40 





1 
< — [No + No, + No] 
Av 
as Q(x, y) < v, + D, + vig. Hence the result. O 


Hence ER(F; F) > 0 provided 


(i) c > 00, Cy > 0, Ciz > 0. 
(ii) ¢,/N > 0, eo/N > 0, ¢)o/N > 0. 


Expressed in terms of the variances of the prior these requirements become 


(i)’ v, > 0, v > 0, viz > 0. 
Giy No, > œ, No, > œ, Nv». > œ. 


Hence as N increases the prior must be tighter [due to (i)’], but not too tight 
[due to (ii)’]. 


6. Asymptotics: the grid case. In this section we bound ER(F, F) for 
data on a grid in two dimensions 


THEOREM 6.1. Suppose we have N = mn observations on a grid 


(ein x): lst sim, 1 <j <n}. 


946 D. BARRY 


Thus, for F € H 
l 7 
ER(B, F) < —P(F) y o|(5 e) | 
N n 


By v2 +o 2) ue 
rol eee) + O[(B,B,,.)'log(B,B,0.2) 
where 
B, = max(x,, — Xu-1) B, = max( xz, č Xay) 


and P(F) is as in Section 3. 


ProoF. From Theorem 4.1 we have that 
F = Ay, 
where 
A= [Z + d,(G, X b,b) + d(a,a; X G3) + d(G, X GJ 
where G, = (g,,) is an m X m symmetric tri-diagonal matrix with 
Eu = ET A + Ga a 2<sism-l, 
En = (x2 — A ae 
Emm = (Xim z Sima) ae 
8u-1 = = (£u — Xp) 2<i<m, 

G, is a similarly defined n X n symmetric tri-diagonal matrix, a,,b,, di, do, dy, 
are as in Theorem 4.1 and A x B denotes the Kronecker product of A and B 
{see Bellman (1970)]. As in Section 5 

1 v 
= — F'(I — A)F + —tr( A? 
ER(F, F) ae A) T (A?) 
and 
F'(I- AYF < P(F). 
We proceed to bound tr( A”). 
The following lemma will be proved later. 


LEMMA 6.2 
A < [I +8 (H, X J) + 8(d, X Hp) + êH, X Ha), 
where 
6, = d,/n?B,, 5, = d,/m’B,, 832 = dy_/B,Bo. 
JJ) is anm X m(n X n) matrix of ones. H(H,) is anm X m(n X n) matrix 


NONPARAMETRIC BAYESIAN REGRESSION 947 


with 
h,= 2, i=j, l<i<m, 
B i=j, i=l,m, 
-1, i=j+1 or j=i+1, 
0, otherwise. 
Noting that 


(i) Hih = 1H, Hyd, = JH, 
(ii) the first eigenvalue of J (J3) is m(n) and the rest zeroes, 
(iii) the eigenvalues of H, are f 
ur 
2(1 - coa(=")}, r=0,1,...,m-1, 
m 
and similarly for H,, we have that 
tr( A?) <1+5S8,4+5,4+S,, 


where 
m-i 1 
SAE eea 
rel f + 2n8(1 — cos( =)}| 
m 
n—1 1 
S = 2 TOO Gr 2? 
r=l h + 2mér(1 — cos =) 
n 
and 
m-ln-1 1 


Sye Lo eee 
MOEI 
m n 
The terms S,, S,, and S,. can be bounded as follows: 
Choose M > 0 such that 


Mx? < 2(1 — cos(x)) for x in [0, 7]. 
Then 


m-i 1 
mr \2|2 
da h + nd,M(—| | 
m 
m n dx 
> 2}2 
T “9 [1 + nô Mx ] 


oae) eela | 








Similarly, 
n B, \'? 
S = o| | O mn| | ; 
(m8)'” md, 
m-l n-i 1 


rel g=] 1+8 wE 4 
i m 


n 
mn dx dy 
= aes 2.242)? 
(1 + 8,.M2x?y ) 
(812) 
= O| mn log——= 
| föra 


=0 








| B,B, yi | B,B, | 
mn ogi —— | |. 
diz di 
Since all the elements of a, are positive and the largest element of Q, is less than 
or equal to one we have 





a,Q,8, < aja, 
=], 
Hence 1/d, = O[v, + v2]. Similarly, 1/d = O[v, + v2]. Hence the theorem 
follows. C 
Proor oF Lemma 6.2. (i) We show 
aa, > J,/m’. 
Both a,a‘, and J, have only one nonzero eigenvalue and 
tr(a, a) = tr(a,a,) 


= (Ea) 


Similarly, b,b’, > J,/n?. 


(ii) G, = (1/B,)H,. Let d, = (x,,-—x,,-,)7|, i = 2,3,..., m. Then, for any 
vector 8 = (8), 89)+++5 8m)'s 


s’G,s = d,(s, — 8:)" + d(s3 — 83) poses tdm, Sm- Scan) 


> [(s_ — 81)? + (85 — 89)” + +- + (8m — 8-1) |/By 
= s'H,8/B,. 


NONPARAMETRIC BAYESIAN REGRESSION 949 
Similarly, G, > (1/B,)H,. Hence the lemma follows. 0 


COROLLARY 6.3. If B, = O(1/m) and B, = O(1/n), then for 
v =R N, o=R, N"? and vy = RyN-(logN)””, 
where R,, Ry, and R, are constants, we have 
ER(F, F) < O[N-/“(log N)]. 
The above bound is much tighter than that obtained in Section 5. 


7. Wabba’s technique. Wahba (1979) has considered the use of “thin plate” 
splines to smooth surfaces in higher dimensions. In two dimensions the simplest 
form of such aut involves choosing Ê to minimise 


(7.1) 2 (y = F(x, xa) ai ef” s Fi + 2F iS $ Fe, 


where F,, = ee ax. 
Wahba (1979) states that the solution Ê, has a representation 


F(x) = d, + dix, + dgx + 3 b,E(x,x,), 
i j=l 
where 


E(x,y) = a — y/Plogx — y| 

with 

k- yl? = (x, - ny + (x - 3): 
It is shown in Wahba and Wendelberger (1980) that the Bayesian procedure 
corresponding to this technique is to assume that 

y, = F(x,) +e, i=1,2,...,n 
with {e,} iid N(0, v) and to put a prior on F which is the same as the stochastic 
process 

ao + ax, + ax + v/7Z(x), 
where 
a = (a, Qi, Q2) ~ N(0, vI), o > 0, 
vı = 0/0, 


and Z(x) is a mean zero Gaussian process with covariance function 


Q(x, y) 5 E(x,y) = L P(x)E(S,,y) 


Jol 
3 


3 3 
- LBW)E(S,x) + LU P(x)P(y)E(S,, 8); 


gal tl j=l 


950 D. BARRY 


where S,, S,, S are chosen so that 


3 
£ ah (S) =0 for k = 1,2,3, 


yal 
=a,=0 for j=1,2,3 


for any basis {¢,, $2, 3} of the space of polynomials of total degree 1 or 0 and P, 
is such that P(S,) = §,. 

Two criticisms of this technique can be made. First, only one smoothing 
parameter is used, suggesting that the function to be estimated is equally smooth 
in all directions. This is rarely true. Second, the equivalent Bayesian formulation 
involves the use of a complicated covariance function seemingly unrelated to that 
used in 1-d. Wahba (1981) suggests a possible answer to the first criticism. In 2-d 
her suggestion corresponds to a roughness penalty of the form 

2 2 272 
(7.2) c IE + 20F2, + OFZ., 
The technique of 2 was originally proposed to address the second criticism but 
has resulted in a technique which also overcomes (at least theoretically} the first 
criticism. 

The roughness penalty (7.1) is invariant under rotation of the x, and x, axes; 
the penalty (7.2) is not. Likewise the Bayesian technique of this paper is not 
invariant under rotation of the axes. 

The roughness penalty (7.1) is the closest to the Bayesian technique in the 
sense that the infinitely smoothed estimate is the least-squares plane while for 
the Bayesian technique the infinitely smoothed estimate is a constant. Wahba 
(1979) considers roughness penalties involving higher derivatives leading to 
higher-order “thin plate” splines. The roughness penalty involving derivatives of 
order 3 is 


(7.3) c | Fs + Fs + 8Fio + 8F io, 
where 
z 3?F 
Yk Ax, Ax, Ax, 


Barry (1983) describes models for incorporating stronger smoothness assumptions 
in the Bayesian framework of this paper. 


8. Simulation study. A simulation study was carried out to compare 
Wahba’s techniques using roughness penalties (7.1) and (7.3) with the Bayesian 
technique of Section 2. For each of four underlying regression functions data were 
generated on the grid 


(x1,,%2,): {1 <i < 10,1 <j < 10}, 


NONPARAMETRIC BAYESIAN REGRESSION 951 


where 
x,, = (2i- 1)/20, X= (27 — 1)/20, 
by setting 


Ny cn F(x, X2,) + ejs 


where {e,,} are iid N(0, v). The four underlying regression functions used were: 
Fy: 6144(xy)°(1 — xy)’; 
F,: 1.5 sin(12x)sin(12 y); 
Fy: (L(x) + 3L(y) + L(x)L(y))/8, 


where 
L(x) = &, 0 < x < 0.25, 

= 2 — 8(x — 0.25), 0.25 < x < 0.5, 

= &(x — 0.5), 0.5 < x < 0.75, 


= 1.5 — &(x — 0.75), 0.75 < x < 1.0; 
F,: 1.52,(x)Z(y), where Z, and Z, are independent Brownian motions. 


F,, F, and F, each have maximum value 1.5. F, is slowly changing and infinitely 
differentiable; F, changes quickly, but is also infinitely differentiable; F} is 
continuous, but only piecewise differentiable, while F} is a sample path from a 
stochastic process which has continuous, nowhere differentiable sample paths. 
Three values for v were used: v = 0.01, 0.0625, and 0.25 (corresponding to 
standard deviations of 0.1, 0.25, and 0.5, respectively). 

For each combination of regression function and error variance v, 50 repe- 
titions were carried out and the average mean squared error obtained using the 
three techniques is recorded in Table 1. In all cases the smoothing parameters 


TABLE 1 
Average mean squared residual using (i) Bayesian technique, 
(ii) Wahba [roughness penalty (7.1)], and (ui) Wahba [roughness penalty (7.3)] 


function v = 0.01 v = 0.0625 v = 0.25 
F, 0.00654 0.0273 0.0725 
0.00575 0.0195 0.0526 
0.00390 0.0175 0.0553 
F, 0.00886 0.0541 0.1629 
0.00983 0.0627 0.1791 
0.00914 0.0429 0.1252 
F, 0.00336 0.0159 0.0495 
0.00775 0.0239 0.0559 
0.00605 0.0230 0.0568 
F; 0.01034 0.0626 0.2140 
0.01034 0.0848 0.2284 


0.01034 0.0775 0.2865 


952 D. BARRY 


were chosen by generalised cross validation as described in Craven and Wahba 
(1979). 

The comparison is not clearcut. A general observation would be that the 
Bayesian technique is best for rougher functions. For the smoothest function F, 
Wahba’s techniques do far better than the Bayesian technique. However, as the 
roughness of the underlying regression function increases the Bayesian technique 
becomes more competitive and does best for F} and F,. 

The comparison is a little clouded for v = 0.01. Here the Bayesian technique 
does best for F, and all techniques are equivalent for F,—in fact they all opt to 
do no smoothing at all. 

The design of a simulation study to compare different smoothing techniques is 
difficult since the choice of test functions seems crucial. The above study, using 
four quite different functions, suggests that the Bayesian technique works well 
when the underlying regression function has limited smoothness properties. 
However, the decision as to which technique to use in a particular situation seems 
very difficult and needs further study. 


9. Regression in higher dimensions. The two-dimensional technique was 
based on a decomposition of the regression function F into four parts analogous 
to a decomposition widely used in the parameterisation of two-way analysis of 
variance: overall mean, row effects, column effects, and interaction terms. Con- 
tinuing the analogy into higher dimensions leads in a straightforward manner to 
the appropriate generalisation of the two-dimensional case. 

In three dimensions, for example, we use the decomposition 

F( x1, X2, £9) = p + a(x) + ao(x_) + og(x3) + a(x, £2) 


tagl Xo, X3) + aial X1, X3) + Oyy9(%1, X2, X3), 


u= f f [F(u,v,w), 


a(x) = SJE v,wW) — p 


where 


[az, a, similarly], 


ayo(%1, 22) = [F(x;,x2,0) — f [F(x1,0,w) 


— f {F(u,x2,w) + f f [Flu v, w) 


[as &,5 similarly], and 


Qyas( Xis Z2, £3) = F(x, X2, x3) — [F(x taw) = [Flu X2, X3) 
~ [F(x,, v, x3) + f [F(x v, w) + f [F(u, x2,w) 
+ f [F(u, v, x9) - f [F(u, v, w), 


NONPARAMETRIC BAYESIAN REGRESSION 953 


where all integrals are from 0 to 1. Appropriately adjusted Brownian sheet priors 
can be placed on each term and Bayes theorem applied to get Ê. 

The extension to higher dimensions is described in detail in Barry (1983) where 
consistency results are also demonstrated. As the number of dimensions increases 
the number of prior parameters increases and it may be necessary to include 
higher-order interaction terms in the error term as is often done in multi-way 
ANOVA. 


Acknowledgments. This research was carried out at Yale University as 
part of the requirements for a Ph.D. degree in statistics. I would like to thank 
Professor John Hartigan for his help and encouragement during the course of this 
research. Thanks are also due to Professor Grace Wahba for sending me copies of 
her work in this area. 


REFERENCES 


Barry, D. (1983). Nonparametric Bayesian regression. Ph.D, thesis, Yale Univ. 

BELLMAN, R. (1970). Introduction to Matrix Analysis. McGraw-Hill, New York. 

CRAVEN, P. and WAHBA, G. (1979). Smoothing noisy data with spline functions: estimating the 
correct degree of smoothing by the method of generalized cross validation. Numer. Math. 
31 377-403. 

KIMELDORF, G. and Wausa, G. (1971). Some results on Tchebycheffian spline functions. J. Math. 
Anal. Appl. 33 82-95. 

SCHOENBERG, I. J. (1964). Spline functions and the problem of graduation, Proc. Nat. Acad. Sct. 
U.S.A. 52 947-960. 

SHepp, L. (1966). Radon—Nikodym derivatives of Gaussian measures. Ann. Math. Statist. 37 
321-364. 

Wausa, G. (1978). Improper priors, spline smoothing and the problem of guarding against model 
errors in regression. J. Roy. Statist. Soc. Ser. B 40 364-372. 

Wausa, G. (1979). Convergence rates of “thin plate” smoothing splines when the data are noisy. 
Technical Report 557, Dept. Statistics, Univ. of Wisconsin, Madison. 

WAHBA, G. (1981). Data-based optimal smoothing of orthogonal series density estimates. Ann. 
Statist. 9 146—156. 

WauHBA, G. and WENDELBERGER, J. (1980). Some new mathematical methods for variational 
objective analysis using splines and cross validation. Monthly Weather Review 108 
1122-1143. 

ZIMMERMAN, G. J. (1972). Some sample-function properties of the two-parameter Gaussian process. 
Ann. Math. Statist. 43 1235-1246. 


DEPARTMENT OF STATISTICS 
UNIVERSITY COLLEGE 

CORK 

IRELAND 


The Annals of Statistics 
1986, Vol. 14, No. 3, 954-970 


BAYES RULES FOR A CLINICAL-TRIALS MODEL WITH 
DICHOTOMOUS RESPONSES! 


By GORDON SIMONS 


University of North Carolina 


The risk in a trial to compare two medical treatments is borne by the 
patients who receive the inferior treatment during the experimental phase 
and by those remaining after the experiment who will all receive the inferior 
treatment if the results are misleading. The Bayes rule indicates, for the 
observed progression of successes and failures, when it is optimal to stop this 
experimental phase. This stopping rule can be described exactly, or nearly so, 
for symmetric two-point priors. Less precise descriptions are possible for other 
types of priors. An admissible stopping rule is described which is best posable, 
among symmetric Bayes rules, in that it minimizes the probability of choosing 
the inferior treatment no matter what the values are for the probabilities of 
success. 


1. Introduction. Much of the literature on sequential clinical trials cir- 
cumvents the difficulties of working with Bernoulli-type responses, “successes” 
and “failures,” by assuming that the treatment responses are normally distrib- 
uted. While such an assumption has certain advantages, it has significant disad- 
vantages as well, including some technical disadvantages: 


(i) The more delicate results for normally distributed treatment responses 
apparently are not directly obtainable [cf. Chernoff and Petkau (1981, 1985)]. 
One first studies a continuous-time free boundary problem associated with the 
heat equation and obtains suitable approximate solutions. Then suitable adjust- 
ments, additional approximations, are required to return to the discrete-time 
setting. This is a difficult agenda. The technical details can seem quite forbidding 
except, perhaps, to a few experts. 

(ii) It is not easy to extrapolate from the normal results comparable results for 
the Bernoulli setting, particularly for small and moderate sample sizes. 


The intent of the present paper is to derive a variety of results directly within 
the setting of Bernoulli-type responses. These, of course, have the advantage of 
direct applicability. Another advantage of working with Bernoulli-type responses 
is that the quality of approximations can easily be assessed by direct numerical 
calculations, without having to resort to more costly and less accurate simulation 
studies. 

The model is as follows. Two contending treatments are to be assigned at 
random to pairs of patients. With the total number of patients, the “horizon,” 
prespecified, this sampling by pairs is to continue until there is sufficient 


Received August 1984; revised October 1985. 

1 This research was supported by National Science Foundation Grant DMS-8400602. 
AMS 1980 sulyect classyications. Primary 60G40, 62C10, 62L05; secondary 62C15. 

Key words and phrases, Clinical trials, Bayes rules, optimal stopping, Markovian states. 


954 


BAYES RULES FOR A CLINICAL-TRIALS MODEL 955 


information about the relative quality of the two treatments that it is prudent to 
assign the more promising treatment to all of the remaining patients. This 
formulation was proposed by Anscombe (1963) for normally distributed treat- 
ment responses. The statistician’s task is to find a suitable stopping rule which 
indicates when sampling by pairs, the “testing phase,” should be stopped. The 
risk (function) used here is the “expected successes lost” (ESL) by the stopping 
rule because it is unknown which of the success probabilities p, and p, is larger. 
It is mathematically equal to the product of |p, — p| and the expected number 
of patients assigned the inferior treatment. 

For any prior distribution on ( p,, p2), the posterior Bayes risk depends on the 
size of the horizon N and on the Markovian state (n, r, s), where n is the current 
number of sampled pairs, and where r and s are the current numbers of successes 
for the first and second treatments, respectively. If the prior distribution is 
supported by two symmetric points (a, b) and (b, a), then the optimal stopping 
rule depends on N and (n,r,s) only through the values of t= N — 2n, the 
number of patients remaining (the “time to go”), and r — s, the current “success 
difference.” 

There are several reasons for being interested in these two-point priors. It was 
demonstrated by Bather and Simons (1985) that the minimax stopping rule, for 
most of the first 200 values of N, is the Bayes stopping rule for a “least 
favorable” symmetric prior on two such points with a + b = 1 (a and b depend- 
ing on N). Another reason is that for any symmetric stopping rule, the risk at 
(Py Pe) = (a, b) is equal to the Bayes risk for a symmetric prior on the two 
points (a, b) and (b, a). (A symmetric stopping rule is one which is indifferent to 
the ordering of the treatments.) Thus the risk at a particular point (p,, P2) = 
(a, 6) can be minimized among all symmetric stopping rules by finding the Bayes 
stopping rule for the symmetric prior supported on (a, b) and (b, a). The value 
of this minimum risk is of some interest when one has little or no reason to think 
that one particular treatment is better than the other. It is a reasonable standard 
by which to evaluate the risk for any symmetric stopping rule. This kind of 
perspective has been pursued in another paper (Simons (1986)). 

The Bayes stopping rules for two-point symmetric priors are discussed in 
Section 2. Inner and outer approximations to the stopping boundary are obtained 
(Theorems 2 and 4). These are good enough to provide an asymptotic description 
of the stopping boundary as t — oo. The inner approximation is exceptionally 
good for small as well as large values of t. In some cases, it quite reliably specifies 
the exact values of ¢ at which the optimal boundary expands. Even when it is not 
this accurate, it suggests a barely suboptimal stopping rule. Approximations to 
the minimal Bayes risk are obtained (Theorem 3) which are good for most, values 
of t. 

It turns out that the optimal stopping boundaries for two-point symmetric 
priors have an outer envelope. This envelope is approached by letting (a, b) go to 

3,4). There are several reasons for being interested in this envelope. Firstly, it is 
the optimal stopping boundary for an easily described random walk S, when the 
stopping reward takes the form ?¢|S,|, where t = N — 2n is the time to go. It is 
not difficult to find this optimal stopping boundary; the corresponding stopping 


956 G. SIMONS 


rule can be described as: 
(1) Continue when |S„| = kif t> Th, k20, 


where Tp, T,, Tas --. is an increasing sequence of positive integers. The first twelve 
values are 2, 14, 41, 82, 136, 204, 285, 381, 490, 613, 749, 900.? Secondly, the outer 
envelope is an outer envelope in a much stronger sense: For any symmetric prior 
distribution G for (p,, Po) (i.e., any prior which is exchangeable in p, and p.) 
and any horizon N, the Markovian state (n, r, s) is a point of optimal stopping if 
t<T,, where t= N -2n and k= |r- s|. It might be difficult to give a 
complete description of the optimal stopping points (n,r,s) for G, but the 
sequence Tp, T,,... can be used quite simply to establish large numbers of triplets 
(n, r, s) as points of optimal stopping. Thirdly, one is led to consider the simple 
symmetric “envelope” stopping rule: 


(2) Continue when |r — s| = Rift=(N-2n)2>T7,, k20. 


This rule is a Bayes rule (for a symmetric two-point prior depending on N), and 
it is admissible. Among Bayes symmetric stopping rules, it uniformly minimizes 
the error probability (the probability of choosing the inferior treatment) at all 
values of (p,, P2), providing the value T, = 2 is used instead of 3. (See footnote 
2.) These ideas and assertions are discussed further in Section 3. 

There are practical reasons why one might wish to work with other than 
symmetric prior distributions. A physician? may begin with a preference for one 
of the two treatments. In such a case he should choose a nonsymmetric prior. 
And even if he has no initial preference, he will probably develop a preference 
during the course of the clinical trial, i.e., his symmetric prior will probably 
become a nonsymmetric posterior eventually. One should be able to begin anew 
with this nonsymmetric posterior, viewing it as an updated prior. If one is to 
continue optimally from this point, one must be able to work with certain types 
of nonsymmetric priors. 

On the other hand, there is no end to possible priors. Even the staunchest 
Bayesian must recognize the futility of asking a physician to specify a prior 
which accurately reflects his true beliefs in complete detail. With this in mind, 
the author suggests to those who are Bayesians (and perhaps to those who are 
not) that a simple compromise is possibly in order. It is suggested that the 
physician be asked to express his beliefs by specifying a single real, possibly 
integer-valued, parameter 6, one whose interpretation is easily communicated in 
a layman’s language: It is positive when the first treatment is considered 
preferable, and negative when the second is preferred. Its magnitude is to denote 
the number of failures of the physician’s preferred treatment, together with an 
equal number of successes of the other treatment, which jointly would cause the 
physician to come to view the two treatments as equally promising. For the 
Bayesian, the posterior effect on @ from assigning the two treatments at random 


* The first of these is a matter of some indifference. The value T) = 3 is optimal as well. 
“The word “physician” should be understood as referring to those whose medical judgments enter 
into the design of the clinical trial. 


BAYES RULES FOR A CLINICAL-TRIALS MODEL 957 


to a pair of patients is to increase @ by one if the first treatment is successful and 
the second is not, to decrease 9 by one in the opposite circumstances, and to keep 
it the same if both treatments yield the same results. For this interpretation to 
make sense, one must begin with a “symmetrizable prior,” a prior which permits 
(at least when 6 is integer-valued) a symmetric posterior. If the physician, after 
due reflection, should choose a value of @ very far from zero, it seems likely that 
frequentists, like Bayesians, would be uncomfortable with a symmetric stopping 
rule. If one has a symmetric stopping rule which one is pleased with to handle 
situations of no initial treatment preference (@ = 0), it easily can be modified to 
accommodate situations of an initial treatment preference (8 + 0). The subject of 
symmetrizable priors is discussed in Section 4. 

Our theory extends to encompass “ethical costs,” which recently have been 
considered by Chernoff and Petkau (1985). An ethical cost occurs whenever it 
appears that a patient is being given the inferior treatment. This extension is 
discussed briefly in Section 5. 


2. The Bayes stopping rule for two-point symmetric priors. Let G be 
any prior distribution for (p,, p2) and let E, denote conditional expectation 
based on the results from assigning the two competing treatments at random to n 
pairs of patients. The conditional expected successes lost for these 27 patients is 
nE,,|P, — Pe|- Suppose N is the total number of patients. If all of the remaining 
N — 2n patients are given the more promising treatment, their conditional 
expected successes lost is (N — 2n)E,(p.—p,)* or (N — 2n)E (Pi — Pa)“, 
whichever is smaller, depending on which treatment is the more promising. Thus 
the posterior Bayes risk at “time” n is 


nE,,|P, — P + (N - 2n)min(E,( p2 —p,)’, E, (Pp: — p)“ ). 
Conveniently, this can be rewritten as 


2 


The first term is a martingale in n and, hence, has no bearing on the question of 
optimal stopping. Consequently, the problem of optimal stopping can be recast in 
terms of a reward sequence defined by 


(4) R, = (N - 2n)|E,( P1 ~ P2)|- 


It can be easily checked that this problem is Markovian with a state parameter 
(n, r, 8), where r and s are the numbers of successes produced by the first and 
second treatments, respectively, among the first n pairs of patients. For the prior 
G, one can identify certain states (n, r, 8) as optimal stopping states (or points) 
and the remainder as optimal continuation states (or points). As a matter of 
convenience, states (n, r, 8) for which stopping and continuation are both opti- 
mal will be assigned both appellations. 

Now suppose that the prior G assigns the probability 4 to each of two 
symmetric points (a, b) and (b, a) with a > b. Then (4) becomes 


(5) R, = (a — b)(N — 2n)tanh((r — sla), 


(3) “Eip -= Pl- (> = n) | BQ( Ps -= Po). 


958 G. SIMONS 


where 

(6) a = Llog(a(1 — b)/(1 ~ a)b). 

The Markovian state (n, r, s) can be replaced by the simpler Markovian state 
(7) (t,k) =(N-2n,r-s). 

Then (a — b)~'!R,, becomes 

(8) R(t, k) = t tanh(|kja). 


The maximum expected reward using an optimal stopping time for the initial 
state (t, k) can be written as (a — b)S(t, k), where S(t, k) is defined recursively 
by 
(9) S(t, k) = max { R(t, k), u,S(t — 2, k — 1) 

+oS(t— 2, k) + w,S(t-2,k+1)}, t22, 
with the initial values S(t, k) = R(t, k) for t = 0,1, where 


pahake Da 
ap v=ab+(1-—a)(1— b), arn eae 
ty = Pook De BP Gb = oA). 


Observe that 8e“ = a(l — b) and Be~* = b(1 — a), so that 28 sinha = a — b. 

The point (t, k) is an optimal stopping point if S(t,k) = R(t, k). It is 
an optimal continuation point if S(t, k) > R(t, k) or if u,S(t — 2,k~1) + 
vS(t — 2, k) + w,S(t — 2, k + 1) = R(t, k). In the latter case, (t, k) is both an 
optimal continuation point and an optimal stopping point (according to a 
previously announced convention). 


THEOREM 1. Given a,b with 0<b<a<1, there is a strictly increasing 
sequence of positive integers To, 7,,7,... for which the state (t, k) is an optimal 
continuation point if t 2 t,, and an optimal stopping point if t < t}. 


Proor. The difference S(t, k) — R(t, k) satisfies recursive relationships akin 
to those in equations (22) and (23) of Bather and Simons (1985). The argument 
then proceeds as for their Theorem 5. 0 


Within the region of continuation, S(t, k) satisfies the difference equation 
(11) Z(t, k) =u,Z(t—2,k -—1) + oZ@(t-— 2,k) + w,Z(t— 2,k +1). 


This equation has many solutions besides the one defined in (9), including some 
fairly simple solutions that can be used to obtain several useful approximations. 
The general form of the symmetric separable solutions is 

cosh kx 


= a a 
(12) Z(t, k) = (v + 2B cosh x) E 


BAYES RULES FOR A CLINICAL-TRIALS MODEL 959 


where the variable x is arbitrary. When x = a, this becomes Z(t,k)=1. A 
multiple of the partial derivative of the right side of (12) with respect to x, 
evaluated at x = a, yields another solution, namely 





2k 
(13) Z(t, k) =t+ tanh ka. 
a-—b 
THEOREM 2.. The point (t, k), k = 0 is an optimal continuation point if 
2sinh kasinh(k + 1)a i 2k tanh ka 
(a — b)sinha a-b ` 

Proor. Consider a particular solution Z(t, k) of (11). A point (¢, k) will be 
called “good” if Z(t, k) > R(t, k), and called “warm” if Z(t, k) < S(t, k). If a 
point (£, k) is good and each of its immediate successors (t — 2, k — 1), (t — 2, k), 
(t — 2, k + 1) is warm, then it is an optimal continuation point. For then 


u,S(t — 2, k — 1) + oS(t — 2, k) + w,S(t — 2, k + 1) 
(15) > u,Z(t — 2, k — 1) + oZ(t- 2, k) + w,Z(t - 2,k +1) 
=Z(t,k) = R(t, k). 

Clearly a point (t, k) is warm if Z(t, k) < R(t, k) since S(t, k) = R(t, k). But 
other warm points can be found. In particular, (¢, k) is warm if each of its 
immediate successors is warm. For then 

Z(t, k) =u,Z(t — 2,k—1) + oZ(t - 2,k) + w,Z(t — 2,k + 1) 
S u,S(t — 2, k — 1) + oS(t — 2, k) + w,S(t — 2,k + 1) 
< S(t, k). 

To prove the theorem, the particular solution Z(t, k) must be chosen with 
some care. Consider a specific point (tọ, ko), ko z 0. For kọ = 0, the solution 
Z(t, k) =0 can be used to establish (14), namely that (tọ 0) is an optimal 
continuation point when tọ 2 2. For kọ > 0, use 


(14) t2=2+ 


(16) Z(t, k) = tanh kya aes 








2k 
7 tanh ka — 


which is a linear combination of a and Z(t, k) = 1. The equation Z(t, k) = 
R(t, k) divides the first quadrant into four regions as indicated in Figure 1. 

The lattice points (t, k) in the cross-hatched regions II and IV satisfy the 
inequality Z(t, k) < R(t, k) and, therefore are warm. It follows by straightfor- 
ward induction, based on ¢, that all of the points in region I are warm as well. All 
of the points in regions I and HI are good, but only some of them are optimal 
continuation points. To be optimal continuation points, it is enough that their 
immediate successors are warm. Thus every lattice point (t, k) in region I is an 
optimal continuation point except, possibly, those (upper) boundary points of the 
form (t, ko) for which (t — 2, Ry + 1) is in region II. But if (tọ, ko) satisfies (14), 
the point (tọ — 2, kọ +1) has to be in region II. So (tọ, ko) is an optimal 
continuation point. O 


960 G. SIMONS 





Fic. 1. 


The values of ¢, and f,, appearing in Figure 1, are 


poa ee Beno 








t = tanh koa + ~ 


age 
(a-b) 


Not only do Z and R agree at (t, ko), but so do their first partial derivatives. 
Thus one should expect Z to closely approximate S in the vicinity of (t, ko) [cf. 
Chernoff (1972), page 93, and Bather (1983)]. This is the case. For instance, when 
a = 0.75 and b = 0.25, the first several values of the right side of (14) are 2, 22.98, 
189.56, and 1651.72, which exactly predict the integer-valued transition points 
referred to in Theorem 1: 1 = 2, 7, = 23, 7, = 190, and 7, = 1652. The same kind 
of accuracy has been observed for 7) through +, when a = 0.6 and b = 0.4, except 
T; is overestimated by one unit. The quality of the approximation is less when a 
and b are close together. For instance, when a = 0.6 and b = 0.5, the values of 
Tos Tis Tos Tio and 7,, are exactly predicted But for 7, through 7, the predictions 
are too large by the amounts, 1, 2, 4, 5, 4, 2, and 1, respectively. This suggests 
that (14) is a good approximation for small and for large values of k. 

The relevant question is: How well do the predicted transition points perform 
when used in place of the (harder to obtain) correct transition points? The 
answer is, they perform exceedingly well. For instance, for the case a = 0.6, 
b = 0.5 referred to above, the actual expected reward is within one-tenth of one 
percent of the optimal expected reward for every state (t, k), t < 2,600. For 
k = 0 (which is relevant for computing the Bayes risk), it is always within two 
one-hundredths of one percent. (The worst values found for ¢ are less than 200; it 
seems unlikely that ¢-values larger than 2,500 can cause problems.) 


sinh?k a. 


THEOREM 3. The Bayes risk for the clinical trial is bounded above by 
a-b 2ko 
(17) 5 ie 
for each k = 0,1,2,.... 








{a aaa + z tanh*h ga} 


BAYES RULES FOR A CLINICAL-TRIALS MODEL 961 


Proor. The Bayes risk takes the form ((a — b)/2XN — S(N,0)). (See the 
discussion preceding Theorem 1.) Now, it is apparent from Figure 1 that the 
point (N, 0) is “warm” in the sense described in the proof of Theorem 2. That is, 
Z(N,0) < S(N,0). The desired conclusion follows immediately from (16). 0 


The quality of the smallest obtainable upper bound is typically quite good. For 
instance, when N = 100, a = 0.6, and b = 0.5, the best choice for kọ is 3, and the 
upper bound is 3.17. The actual value of the Bayes risk is 3.14. For N = 2,500, 
the best bound is 13.360 and the Bayes risk is 13.359. 

There is an interesting analogue to Theorem 2. 


THEOREM 4. The point (t, k) (k > 0) is an optimal stopping point if 
2sinh kasinh(k + 1)a 
(a — b)sinha 





(18) t<2+ 


Proor. Again let Z(t, k) be a particular solution of (11). And again there is a 
need to refer to “good” and “warm” points, but with new meanings. Here a 
point (t, k) will be called “good” if Z(t, k) < R(t, k), and called “warm” if 
Z(t, k) > S(t, k). If a point (t, k) is good and each of its immediate successors 
(t— 2, k — 1), (t — 2, k), (t — 2, k + 1) is warm, then it is an optimal stopping 
point. This is proved by reversing the inequalities in (15). 

Clearly, a point (¢, k) is warm if Z(t, k) > R(t, k) and S(t, k) = R(t, k). But 
other warm points can be found. In particular, (t, k) is warm if Z(t, k) > R(t, k) 
and each of its immediate successors is warm. For then 

u,S(t — 2, k — 1) + oS(t— 2, k) + w,S(t — 2, k + 1) 


< u,Z(t — 2, k — 1) + Z(t- 2, k) + w,Z(t — 2,k +1) 


= Z(t, k), 
so that 
Z(t, k) = max( R(t, k), u,S(t — 2, k — 1) + oS(t — 2, k) + w,S(t — 2, k + 1)) 
= S(t, k). 


If Z > R at (t, k) and at all of the successors of (t, k), then (t, k) is warm. This is 
easily shown by induction. 

To prove the theorem, the particular solution Z(t, k) of (11) must be picked 
with care: Consider a specific point (tọ, ko), to = 2, ko = 0. The case ky = 0 is 
easily disposed of with Z(t, k) = 0. For kọ > 0, use 
(a — b)tptanh kya oa 2k k 
aea daha aab T 
Since R(t, k) = t tanh [R\a, it follows that Z(t), ko) = R(to, ko). Thus (to, ko) i8 
a good point. It must be shown that its immediate successors (tọ — 2, Ry — 1), 
(to — 2, ko), (to — 2, ko + 1) are warm. It is enough to show that Z > R at all of 
the successors of (to, ko). 

Now the equation Z(t, k) = R(t, k) divides the first quadrant into two regions 
as indicated in Figure 2. In region I, Z > R. In region II, Z < R, so that all of its 





(19) Z(t, k) = 


962 G. SIMONS 


ow 


H 


Fia. 2. 


lattice points are good. The point (tg, ko) is located somewhere on the boundary 
of region II. If its position is like one of the points B and C, shown in Figure 2, all 
of its successors will be in region I, where Z = R, and the argument will be 
complete. However, if its position is like one of the points A and D, then some of 
its successors can be in the interior of region II, where Z < R. This situation 
must be avoided. It can be avoided by restricting the range of tọ. Because of 
Theorem 1, it is enough to complete the proof when 


(20) 2sinh kyasinh(k, + 1)a P 2sinh kasinh(ko + 1)a 
lan bhinke nS (a — b)sinha 

For if (ta ko) is an optimal stopping point whenever (20) holds, then it must also 
be an optimal stopping point whenever just the latter inequality in (20) holds. 

When (20) holds, all of the successors of (t), ko) are in region I. Since the proof 
of this is tedious but not difficult, it will be omitted. (Several elementary 
inequalities involving hyperbolic functions must be isolated and checked.) This 
completes the proof. 0 


Together, Theorems 2 and 4 provide for the proper classification of a large 
number of states (t, k). They show that the transition points 1,, referred to in 
Theorem 1, grow asymptotically with & like 2 sinh ka sinh(k + 1)a/(a — b}sinh a. 
Thus the boundary grows with ¢ at the rate (2a)~ ‘log t. More precise statements 
are possible. 

Certain pairs (t, k) are optimal stopping states for every a and b (a> b). 
Some of these “universal optimal stopping states” can be found by using 
Theorem 4. It can be shown that the right side of (18) is (tightly) bounded below 
by 2 + 4k + 4k”. So (t, k) must be a universal optimal stopping state whenever 
t<2+4k + 4k?. Now the lower bound 2 + 4k + 4k? is achieved only in the 
limit as a and b approach one-half. This suggests that the set of optimal 
stopping states approaches its minimal possible size as a and b approach 
one-half. Further encouragement for this suggestion is provided by the fact that 
the lower bound (2 + 4k + 8k”) of the right side of (14) is also achieved in the 
limit as a and b approach one-half. Since the suggestion is correct, it should not 


BAYES RULES FOR A CLINICAL-TRIALS MODEL 963 


be surprising that the key to characterizing the universal optimal stopping states 
is the examination of the limiting forms of R(t, k) and S(t, k) as a and b 
approach one-half. 

For the first of these 


‘ cae | . -1 
= k)= lim -b k|a = 2tik]. 
(21) lim (ab) RGR) im (a-b) ttaka = 2018 
It easily follows, by induction, from (9) and (21) that (a — b)~'S(t, k) converges 
to a limit S*(¢, k) as a, b > } and that 
S*(t, k) = max{2tk|,iS*(t-—2,k-1 
(22) (t, k) {2tik|, $S*( ) 


+4S*(t— 2, k) +18*(t-2,k+1)}, t22, 
with initial values S*(t, k) = 2¢\k| for t = 0,1. 


THEOREM 5. The pair (t, k) is an optimal stopping state for every a and b 
(a > b) if and only if S*(t, k) = 2tik|. Moreover, there is an increasing se- 
quence of positive integers T,, T,, Tz, -.. for which S*(t, k) = 2t|k| if and only if 
t < T,. The first five values of the sequence are 3, 14, 41, 82, 136. 


Proor. The second assertion is proved in the same way Theorem 1 is proved, 
and the values of the sequence are found by direct numerical calculations using 
(22). If (t, k) is an optimal stopping state for every a and b, then 


* 2 r 2 -1 = . PE -1 = 
S*(t, k) _ im (4 b)'S(t,k) = | lim „(a — 6) ‘¢ tanh|kja = 21. 


Now consider a specific pair a and b, a > b. To establish the converse, it will 
be shown that 


4B cosh ka 

(a-b) 
for all pairs (ż, k). It will follow whenever S*(t, k) = 2t|k| that S(t, k) = 
t tanh|kja, so that (t, k) is an optimal stopping state for a and b. Denote the left 
and right sides of (23) by Q(t, k) and Q*(t, k), respectively. It follows from (9) 
and (22) that for t > 2, 

vQ(t — 2,0) + 28Q(t— 2,1) + 48(t-2),  k=0, 

24 t,k) = = 
( ) Ql f ) hens Q(t, k)), k+ 0, 
and 


(23) [S(t, k) — ttanhjkja] < S*(t, k) — 2tik] 


£ _ | 4Q*(t— 2,0) + 4Q*(t— 2,1) + = 2, k=0, 
Co a ee k#0, 
where 
Q(t, k) ss BQACE = 2, k= 1) + vQ(t = 2, k) 


2) + BQ(t- 2,k+1)- -H sinh) Ala 


964 G. SIMONS 


Q*(t, k) =1Q*(t- 2, k — 1) +1Q*(t-2,k 
on Q(t, k) = 10% ) +4Q%(¢—2,h) 


+1Q*(t— 2, k +1) — 4ļ|kl. 
The task is to show that Q*(¢, k) > Q(t, k) for all (¢, k). This is obvious from 


(23) for t = 0,1; both sides of (23) are zero for all k. The proof proceeds by 
induction. Assume that Q*(t — 2, k) > Q(t — 2, k) for all k. Then 


Q*(t,0) = Q(t, 0) = (4- B)(Q*(t E 2,1) + A(t me 2)) 


+3 gi v)Q*(t T 2,0) E (3 Si B)Q*(t ~ 2,1) 
and since Q(t — 2, k) = 0 for all k and 
88 8k] . 


g panlele = sa pee = ARI, 


(28) 


one obtains for k # 0, 
Q*(t,k) — Q(t, k) = (4 — B)Q*(t— 2, k — 1) + (4- v)Q*(t- 2, k) 


+(1- p)Q*(t-2,k +1). 


It will be shown that the right sides of (28) and (29) are nonnegative, so that 
Q*(t,0) = Q(t, 0), and Q*(t, k) = Q(t, k) for k # 0. From (24) and (25), it follows 
that Q*(t, k) = Q(t, k) for all (t, k). 

The right sides of (28) and (29) have the form a,x, + aax + a,x, with 
(a,, a.) = (4 — B,4 — v). Since, in general, a,x, + agx} + ax; = a(x; + x3 ~ 
2x.) + (2a, + a,)xo, this expression is nonnegative if 


(29) 


a,>0, 2a,+a,20, Xz 2 0, x; + x3 — 2x, 2 0. 


But a, > 0 since £? = a(1 — a)b(1 — b) < 4, and 2a, + a; > 0 since 48° = 
4a(1 — a)b(1 — b) < (a(l — b) + b(1 — a)? = (1 — 0)”. Clearly, the values of 
X, arising from the right sides of (28) and (29), are nonnegative. Finally, it 
follows from Proposition 1 below that the values of x, + x, — 2x., arising from 
the right sides of (28) and (29), are nonnegative. [It is enough to consider k > 0 
since (19) is symmetric in &.] O 


PROPOSITION 1. For each t 20, the sequence Q*(t,1) + 4t, Q*(t, 0), 
Q*(t,), Q*(t, 2),... is convex in the sense that all of its second differences are 
nonnegative. 


Proor. This is obvious for ¢ = 0 and 1 since, for such t, Q*(t, k) = 0 for 
all k. Suppose the sequence is convex when ż is replaced by ¢t — 2, t > 2. The 
task is to show the convexity for t. It is enough [see (25)] to show that the 
sequence Q*(t,1) + 4t, Q*(t,0), Q*(t, 1), Q*(t,2),... is convex. For convenience, 


BAYES RULES FOR A CLINICAL-TRIALS MODEL 965 


Q*(t — 2, k) will be abbreviated to Q(%). Then 
(Q*(t,1) + 4t) + Q*(t,1) — 2Q*(¢,0) 
— $Q(0) + $Q(2) + 2(¢ — 2) 
= 4([Q(1) + 4(t — 2)] + Q(1) — 2Q(0)) + $(Q(0) + Q(2) - 2Q(1)) > 0. 
Likewise, 
Q*(t,0) + Q*(t,2) — 2Q*(t, 1) 
= —1Q(1) + 4Q(3) + (¢- 2) 
= 4([Q(1) + 4(¢ - 2)] + QC) - 2Q(0)) + $(Q(0) + Q(2) — 2Q(2)) 
+ 4(Q(1) + Q(3) — 2Q(2)) > 0. 
Finally, for k = 2, 
Q*(t,k — 1) + Q*(t, k + 1) — 2Q*(t, k) 
= 4Q(k — 2) + 7Q(R + 2) - 3Q(k) > 0. o 


l 


3. The Bayes stopping rule for symmetric priors. Let G be a symmetric 
prior distribution for (p,, p2) and let N denote the horizon. As noted earlier, the 
Bayes stopping rule can be described in terms of Markovian states (n, r, 8), 
where n represents the (current) number of sampled pairs, and where r and s are 
the (current) numbers of successes for the first and second treatments, respec- 
tively. 

It seems unlikely that there is anything comparable to Theorem 1 which could 
provide a simple description of the optimal continuation region for this more 
general setting. For instance, (n, r + 1, s + 1) need not be an optimal continua- 
tion state when (n, r, s) is. This cannot happen with two-point symmetric priors 
because, for such priors, the relevant Markovian state is (t, k) = (N — 2n,r— 8). 
Moreover, (n, r, 8) can be an optimal continuation state even though (n — 1, r, 8) 
is not for a larger horizon. Again, this cannot happen with two-point symmetric 
priors. The full range of possibilities is unknown. 

Nevertheless, there is one simple result which can be used to identify certain 
triplets (n, r, s) as optimal stopping states. 


THEOREM 6. The triplet (n, r,s) is an optimal stopping state for a symmet- 
ric prior G if for each (a, b), a > bin the support of G, (t, k) = (N — 2n,r — 8) 
is an optimal stopping state for the symmetric prior on the two points (a, b) and 
(b,a). In particular, (n, r,s) must be an optimal stopping state if 


(30) S*(N — 2n,r —s)=2(N —2n)ir- sl], 
where S* is defined in (22). 
Proor. Let U(n,r,s) and Vin, r,s) denote the reward for stopping [im- 


plicitly defined in (4)] and the optimal stopping reward, respectively, for the state 
(n, r, 8). Further, denote the functions R(t, k) and S(t, k) (described in Section 


966 G. SIMONS 


2) by R, a(t, k) and S, ,(t, k), a > b, in order to reflect their dependence on a 
and b. For b >a, let R, a(t, k) = R, a(t, k) and S, a(t, k) = S, a(t, k). It is 
easily checked that 


(31) U(n,r,s) = [fie — BIR, (N — 2n, r — 8)G(da, dbjn, r,s), 
o 40 


where G(-|n, r,s) denotes the posterior distribution for (p,, Pa) in the state 
(n, r, 8). Finally, by a routine backward induction argument based on the size of 
n, it can be shown that 


(32) V(n, r,s) < W(n,r,s), 
for all possible states (n, r, s), where 


W(n,r,s) = [fie — BIS, (N — 2n, r — s)G(da, dbjn, r,s). 


Now consider the first statement of the theorem. By assumption, S, ,(t, k) = 
R a, a(t, k) for every pair (a, b) in the support of G. Thus W(n, r, s) = U(n, r, 8), 
and it follows from (32) that V(n, r, 8s) = U(n, r, s). Consequently, (n, r, s) is an 
optimal stopping state. When (30) holds, it follows from Theorem 5, that 
Sq, a(t, k) = Ra, a(t, k) for every pair (a, b). m 


Let us briefly consider the stopping rule suggested by Theorem 6, namely stop 
as soon as one reaches a state (n,r,s) for which (30) holds. Stated more 
explicitly: stop in state (n,r,8) if 2n > N — T, where k = |r — s|. The begin- 
ning of the sequence T, 7,,... is given in Theorem 5, and the terms of the 
sequence can be computed using (22). Additional values are given in the introduc- 
tion. Since (a — b)~'S(t, k) > S*(t, k) as a, b —> + (see Section 2), it can be 
shown that the rule is Bayes for appropriately chosen symmetric prior distribu- 
tions, depending on N, and that it is admissible.* It can be expected to perform 
well, from a Bayesian perspective, whenever the prior distribution is symmetric 
and concentrated near (p,, P2) = (4, 4). 

One pleasant feature of this stopping rule, when T, = 2, is that it minimizes 
for every pair ( p,, P2) the probability of rejecting the better treatment among all 
Bayes symmetric stopping rules. While the probability of rejecting the better 
treatment is not of direct concern under Anscombe’s (1963) model, Bather (1985) 
has made a reasonable case for its consideration in the context of sequential 
clinical trials. The proof that this probability is minimized depends upon two 
facts. Firstly, among Bayes symmetric stopping rules, this “envelope” rule with 
To = 2 is the largest possible, i.e., the slowest to stop. This is a consequence of 
Theorem 6. Secondly, for any symmetric prior distribution, the sequence of 
posterior probabilities of rejecting the better treatment, for n = 0,1,2,..., 
2n < N, is a supermartingale. Consequently, the obvious is true: the longer one 
samples by pairs the smaller is the probability of rejecting the better treatment. 


‘These assertions are still true if the value T) = 2 18 used instead of the value Ty = 3 given in 
Theorem 5, The risk function is unaffected by the change. 


BAYES RULES FOR A CLINICAL-TRIALS MODEL 967 


It seems highly plausible that this “envelope rule” is, in fact, the largest 
admissible symmetric stopping rule." If so, then the obvious rule based on N/2 
pairs is excluded because of the loss structure; it could not be admissible. 

Finally, some insight into the nature of this stopping rule can be obtained by 
considering the random walk S, = 0, S,,S,,... whose step sizes — 1,0,1 are 
taken with probabilities 1,4,1, respectively. The same stopping rule with |r — s| 
replaced by |S,| is optimal for the reward sequence (N — 2n)jS,|, i.e., it is 
optimal to stop as soon as 2n > N — T,, where k = |S,|. The connection is 
apparent from the form of (22). One can use T, = 2 or Tọ = 3. (See footnote 4.) 
The values of all of the other T,’s may be unique; no other exceptions have been 
found between k = 1 and k = 23. (Th, = 3,773.) 


4. Symmetrizable distributions. The rationale for this topic has already 
been indicated in the introduction. It is not always appropriate to use a 
symmetric prior. And yet it seems too much to expect to find a theory of much 
depth which includes all possible priors. Both of these issues are addressed by the 
consideration of symmetrizable distributions. The class of symmetrizable priors is 
probably sufficiently large to meet the needs of practitioners. And yet they are 
convenient to work with theoretically. Roughly speaking, whatever is true for 
symmetric priors is also true, in a suitably modified sense, for symmetrizable 
priors. Currently, the theory for symmetric two-point priors is quite a bit more 
satisfactory than is the theory for general symmetric priors. This distinction 
carries over to symmetrizable priors. 

The emphasis in this section is expository. While the concepts and results are 
stated precisely, no proofs are given. Most of the proofs are fairly straightforward 
and can easily be supplied by the reader. 

A distribution G on the open unit square is said to be symmetrizable with 
associated parameter @, and one writes G € ®(6), if G( p, + pz) > 0 and the 
measure G’ defined by 


(pide) °G(dp,, dp,), 6 2 0, 


(p291)°G(dp,, dp), <0, 
is symmetric in p, and p}, where q, = 1 — p, and q, = 1 — pz. The restriction 
to the open unit square is for convenience and seems harmless. Likewise, a prior 
G for which G( p, # px) = 0 is of no interest here. The parameter @ can assume 
any real value. When @ is integer-valued (and the horizon N is sufficiently large), 
G can have a symmetric posterior. 

In the sequel, it is convenient to think of the measure G’ as a prior. It may not 
be a finite measure, in which case it is better thought of as an improper prior—it 
cannot be normalized to make it a probability measure. 


(33) G’(dp,, dp, ) ma 


"The envelope rule is the largest admussıble symmetric stopping rule. As Larry Brown has 
pointed out to the author, this easily follows from a remarkable lemma appearing in Gutmann (1982): 
Any admissible symmetric stopping rule must be Bayes and, hence, no larger than the envelope 
(stopping) rule. 


968 G. SIMONS 


The associated parameter @ is unique. Moreover, 
(34) sien( f" '(p, ~ p2)G@(dp,, dpa)) = sign(). 


So when G is the prior distribution, the sign of 0 indicates which treatment is 
preferred. The first treatment is preferred when 8 > 0, and the second when 
0 < 0. When @ = 0, neither treatment is preferred because G is symmetric. 

Suppose G € (6) is a prior distribution. Then the posterior distribution Gp, 
after the two treatments have been assigned to n pairs of patients, is symmetriz- 
able with the associated parameter 


(35) 6,=0+r-—s, 


where r and s are the numbers of successes with the first and second treatments, 
respectively. Without having to compute any posterior expectations, one can 
decide which treatment is currently preferred by simply examining the sign of @,,. 

Now suppose G is a prior on two symmetric points (a, b) and (b, a). Then 
G € (6), where 


(36) = —1og(G((a, b))/G((b, a))), 


and where a is defined in (6). Under this prior, the problem of optimal stopping, 
described in Section 2, depends upon the Markovian state (t, k) defined in (7). 
There are analogues of Theorems 2, 3, and 4. The analogue of Theorem 2 states 
that (¢, k) is an optimal continuation state if 
2sinh(k + @)asinh(k+6+1)a 2(k+@)tanh(k + O)a 
(37) PaRa (a — b)sinha z a-b : 
k+620. 

It seems likely that the stopping rule which continues as long as (37) holds will 
perform very nearly as well as the optimal stopping rule. 

The picture is less complete for a general symmetrizable prior G. When 0 + 0, 
G is not symmetric and it is convenient to work, instead, with the symmetric 
“prior” G’ defined in (33). For the sake of definiteness, assume @ > 0. The 
horizon N and the Markovian state (n, r, s) need to be replaced by N’ = N + 26 
and (n’,r’,s’) =(n+ 8,r + 6,8), respectively. These are easiest to interpret 
when @ is an integer. In any event, the new “number of patients remaining,” 
N’ — 2n’, in state (n’, r’, 8’) is an integer, and it is equal to the old number of 
patients remaining, N — 2n, in the state (n, r, 8). So even if 0 is not an integer, . 
there is no inherent problem in carrying out the required backward induction to 
decide which points (n’, r’,s’) are optimal continuation points under G’, and 
which are optimal stopping points. Notice that 


(38) pipzar "az °G'(dp,, dpa) = pipiqt "gz *G(dp,, dpo). 
So even if G’ is viewed as an improper prior, there is a proper posterior in state 


(n’, r’, 8’). And as (38) shows, this posterior agrees with that for G in the state 
(n,r,s). It follows that (n, r,s) is an optimal continuation point under G 


BAYES RULES FOR A CLINICAL-TRIALS MODEL 969 


whenever (n’, r’, s’) is an optimal continuation point under G’. And the same 
relationship applies to optimal stopping points. 

It is presently impossible to obtain many of the benefits promised by the 
“machinery” just described because the current theory for general symmetric 
priors is far from adequate. Nevertheless, Theorem 6 can be exploited when @ is 
an integer: The point (n, r, 8) is an optimal stopping state for the symmetrizable 
prior G € ©(@) if 2n > N — T,, where k = |r + 0 — s|. (The meaning of T, has 
been discussed in previous sections.) 

A second benefit suggests itself for those who already have a stopping rule 
which they prefer to use whenever there is little or no reason to believe that one 
treatment is better than the other. [Several candidates for this kind of rule have 
been suggested by Vogel (1960a, b), by Anscombe (1963), by Lai, Levin, Robbins 
and Siegmund (1980), and by Bather and Simons (1985). Such rules can be 
modified to handle situations for which there is an initial preference; stop in state 
(n, r, 8) if the preferred rule says to stop in state (n', r’, 8’). 


5. Ethical costs. Chernoff and Petkau (1985) have recently shown how 
“ethical costs” can be incorporated into Anscombe’s (1963) model when the 
treatment responses are normally distributed. The same thing can be done when 
the treatment responses are “successes” and “failures.” The idea is quite simple: 
For any prior G, after n pairs of patients have been treated, with the results 
observed, one expects the two treatments to yield successes in the future with 
probabilities E„p, and E, p,, where “E,,” denotes conditional expection given 
the results from the n pairs of patients. Thus |E,(p, — p,)| represents a 
reasonable estimate of the “expected successes lost” (a fraction of one) should 
a future patient be assigned to the apparently inferior treatment. Accord- 
ing to Chernoff and Petkau’s reckoning, the physician incurs an ethical cost 
y\|E,( Pı — Pe)| if the inferior appearing treatment is actually assigned to a future 
patient, where the proportionality constant y > 0 is a known parameter. 

The mathematical effect of this innovation is fairly slight. Instead of the 
reward sequence described in (4), one must use 


(39) R, = (N — 2n)|E,( Pp; — P2)|~ 2y DBP pe) 


In general, this reward is no longer a function of a Markovian state (n, r, 8), 
where r and s are the numbers of successes registered by the first and second 
treatments. Nevertheless, the optimal stopping problem is still Markovian, 
i.e., dependent on (n, r,s). Intuitively, this is because the ethical cost 
YERL |Em( P1 — P2)|, entering into (39), is the result of past decisions; it is 
nonrecoverable. Consequently, the difference between the optimal stopping re- 
ward and the reward for stopping is a function of the state (n, r, s). When this is 
strictly positive, it is optimal to continue; when this is zero, it is optimal to stop. 

The previous results in this paper can .be extended to this setting without 
much difficulty. For instance, for Theorem 2, the simpler Markovian state (t, k) 
is still appropriate, and it turns out that (t, k), k = 0 is an optimal continuation 


970 G. SIMONS 


point whenever 
2(1 + y)sinh kasinh(k + 1)a 2k tanh ka 
(a — b)sinha aa 
The asymptotic growth rate of the transition points 7,, referred to in Theorem 1, 


is increased by the factor 1 + y. The boundary still grows with ¢ at the rate 
(2a) ‘log t, independent of y (as a first-order approximation). 


t2a+ 


Acknowledgments. The author is grateful to John Bather for numerous 
conversations, during the author’s sabbatical at the University of Sussex, which 
have contributed to the development of the ideas in this paper. Readers who are 
familiar with his techniques for approximating stopping boundaries, from within 
and from without, will recognize the influence his ideas have had on Theorems 2, 
3 and 4. See Bather (1983). In addition, the author is indebted to John Bather 
and an unknown referee for very helpful comments on the original version of this 
paper. 


REFERENCES 


ANSCOMBE, F. J. (1963). Sequential medical trials. J. Amer. Statist. Assoc. 58 365-383. 

BATHER, J. A. (1983). Optimal stopping of Brownian motion: A comparison technique. In Recent 
Advances in Statistics (M. H. Rizvi, J. S. Rustagi, and D. Siegmund, eds.) 19-49. 
Academic, New York. 

RATHER, J. A. (1985). On the allocation of treatments in sequential medical trials. Internat. Staust. 
Rev. 83 1-14. 

BATHER, J. A. and Smons, G. D. (1985). The minimax risk for two-stage procedures in clinical trials. 
J. Roy. Statist. Soc. Ser. B 47 466-475. 

CHERNOFF, H. (1972). Sequential Analysis and Optimal Design. SLAM, Philadelphia. 

CHERNOFF, H. and PETKAU, A. J. (1981). Sequential medical trials involving paired data. Biometrika 
68 119-132. 

CHERNOFF, H. and PETKAU, A. J. (1985). Sequential medical trials with ethical costs. Proc. Berkeley 
Conf. in Honor of Jerzy Neyman and Jack Kvefer (L. Le Cam and R. Olshen, eds.) 1. 
Wadsworth, Monterey, Calif. 

GUTMANN, S. (1982). Stein’s paradox is impossible in problems with finite sample space. Ann. 
Statist. 10 1017-1020. 

Laz, T. L, Levin, B. ROBBINS, H. and SIEGMUND, D. (1980). Sequential medical trials. Proc. Nat. 
Acad, Sci. U.S.A. T7 3135-3138. 

Simons, G. (1986). A comparison of seven allocation rules for a clinical trial model. Sequential Anal. 
To appear. 

VocEL, W. (1960a). A sequential design for the two-armed bandit. Ann. Math. Statist. 31 430-443. 

VoGEL, W. (1960b). An asymptotic minimax theorem for the two-armed bandit problem. Ann. Math. 
Statist. 31 444-451. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF NoRTH CAROLINA 
CHAPEL HILL, NORTH CAROLINA 27514 


The Annals of Statistics 
1986, Vol. 14, No. 3, 971-993 


AN EXTREME VALUE THEORY FOR SEQUENCE MATCHING 


By RICHARD ARRATIA,''? Louis GORDON AND MICHAEL WATERMAN? 
University of Southern California 


Consider finite sequences X,, X2,..., Xm and Y,,¥;,..., ¥, where the 
letters {X,} and {Y,} are chosen iid. on a countable alphabet with p = 
P(X, = Y,} € (0,1). We study the distribution of the longest contiguous run 
of matches between the X's and Y’s, allowing at most k mismatches. The 
distribution is closely approximated by that of the maximum of (1 — p)mn 
iid. negative binomial random variables. The latter distribution is in turn 
shown to behave like the integer part of an extreme value distribution. The 
expectation is approximately 

log(qmn) + k loglog(qmn) + k log(q/p) — log(k!) + y log(e) — ; 
where q = 1 -— p, log denotes logarithm base 1/p, and y is the Euler 
constant. The variance is approximated by (v log(e))?/6 + 4. The paper 


concludes with an example in which we compare segments taken from the 
DNA sequence of the bacteriophage lambda. 


0. Introduction. DNA sequences can be represented as finite sequences over 
the four-letter alphabet {A, C, G, T}. Such a sequence corresponds to successive 
appearances of the nucleotides adenine (A), cytosine (C), guanine (G), and 
thymine (T). One of the impressive accomplishments of molecular biology is the 
facility with which the sequences corresponding to actual genetic material are 
determined. Much effort is currently invested in determining the DNA sequences 
belonging to the chromosomes of various organisms. See for example the book 
Nucleotide Sequences 1984 [Anderson et al. (1984)], which is an atlas of such 
representations. By mid-1985, DNA sequences with a total length of approxi- 
mately 5 x 10° were known, and sequencing was proceeding at an approximate 
rate of 10° letters per year. 

Sequences belonging to seemingly unrelated organisms have been found to 
possess long contiguous subsequences which are practically identical. Doolittle et 
al. (1983) report an unexpected relationship of this kind between viral DNA and 
host DNA. The identification and interpretation of such shared contiguous 
subsequences are of substantial interest to biologists. See Waterman (1984) for a 
review of these methods. 

These aspects of matching between sequences lead us to ask the following 
mathematical question: for two independently generated random sequences, what 
is the distribution of the length of the longest run of contiguous matches? 
Evolution of nucleotide sequences proceeds by substitution, insertion, and dele- 
tion of nucleotides. Substitutions motivate us to study the distribution of the 
length of the longest contiguous run of matches allowing for a fixed number k of 


Received October 1984; revised October 1985. 
‘Supported by NSF grant DMS-8402590. 
? Supported by a grant from the System Development Foundation. 
AMS 1980 subject classifications. Primary 62E20; secondary 62P10. 
Key words and phrases. Extreme value, matching, Poisson process, inclusion~exclusion, DNA 
sequences. 
971 


972 R. ARRATIA, L. GORDON AND M. WATERMAN 


mismatches. We do not obtain corresponding results for the more difficult case of 
insertions and deletions. 

In Smith et al. (1985), a data analysis shows real, unrelated DNA sequences to 
have the same distribution of matching scores as independent sequences of the 
same composition. This supports the claim that the distributional results of this 
paper have genuine biological interest. In Section 7, the sequence of the virus 
lambda is shown to fit the distribution very well. 

The question of the longest run of consecutive matches is closely related to the 
study of runs in a single sequence. Erdés and Révész (1975) consider the length of 
the longest run of heads in a sequence of heads and tails generated by fair coin 
tossing. Using a combinatorial approach, they provide almost sure upper and 
lower boundaries for the longest head run, and for runs interrupted by at most k 
tails. 

Guibas and Odlyzko (1980) use generating function methods to provide deep 
results on problems related to those of Erdés and Révész. They look at the 
longest run of repetitions of a specified pattern. Among many other results, they 
compute the expectation and variance of the length of the longest run of 
repetitions, and they make the intriguing remark that the length of the longest 
run of repetitions has no limiting distribution. 

Several results from Guibas and Odlyzko are generalized to the case of biased 
coin tossing, and runs with at most & interruptions, in Gordon, Schilling, and 
Waterman (1986). In that paper, a sequence of coin tosses is represented in terms 
of independent geometric random variables, and then analyzed using 
inclusion—exclusion and counting methods in a way similar to Watson (1954). 

This paper gives results for the approximate large sample distribution, mean 
and variance for the length of the longest run of consecutive matches, allowing a 
fixed number k = 0,1,2... of mismatches, between two sequences with all letters 
independently drawn from some given distribution. The results in their most 
useful form are collected and proved in Theorems 1 and 2 of Section 4. Karlin et 
al. (1983) report limiting variances and expectations for pure runs of matches of a 
sequence with itself, similar to those we obtain in Theorem 2. Strong laws of large 
numbers for the length of the longest match between two or more sequences were 
given in Arratia and Waterman (1985a) and (1985b). 

Define the length of the longest match starting within the first m letters of the 
sequence X,, X,,... and the first n letters of Y,, Y,,... to be 


M(m, n) = max{u: X,,, --- Xou" Yar -' Y, 


itu jtur 
for some 0 <i<m,0<j <n}. 

Throughout this paper we assume that the letters {X,} and {Y,} are all chosen 
independently from a countable alphabet S according to a nontrivial distribution 
u. Let p represent the probability that two different letters match, so that 
(0.1) P= È (ma) € (0,1). 

aeS 

The definition of M(m, n) directly suggests an analysis according to posi- 

tion. Let A(i, j) be the length of the match which begins after position (i, J), 


SEQUENCE MATCHING 973 
so that A(i, J)2u iff Xaa Xiu = Yar tt Yau Thus M(m,n)= 
MAXg <,<m,0<,<nA(i, J), expressing the length of the longest match as the 
maximum of mn random variables A(i, J), where each A(i, J) is geometrically 
distributed with parameter p. These random variables A(i, J) are dependent; it 
should not be intuitively obvious that their maximum is comparable in distribu- 
tion to the maximum of mn independent copies of A(z, J). 

Note that the distribution of the family {A(z, j)} is strictly stationary with 
respect to translations in each of the two parameters. A general theory of maxima 
for stationary sequences is well developed, but does not seem directly applicable 
here; see for example the book by Leadbetter, Lindgren and Rootzén (1983). 
Another situation involving the maximum of a multiple-parameter stationary 
family occurs in Darling and Waterman (1985). 

Our main result, which is stated in Theorem 2, can be loosely described as: if 
m and n > œ with (log m)/(log n) > 1, then M(m, n) has approximately the 
same distribution as the maximum of (1 — p)mn independent geometric( p) 
random variables. Thus M(m, n) — log,,,((1 — p)mn) has approximately an 
integerized extreme value distribution: P( M(m, n) — log,,,((1 — p)mn) < c) is 
uniformly approximated by exp(—p‘), for c such that c + log,,,((1 — p)mn) is 
an integer. Theorem 2 also states results for the length of the longest match run 
with at most k mismatches, which behaves like the maximum of (1 — p)mn 
independent negative binomial random variables. 


1. Survey of the analysis. The mutual dependence of the random variables 
A(i, j) shows up in two qualitatively different ways. First of all, for each (t, j), 
the events {A(i, J) 2 u} and {A(i + 1, 7+ 1) 2 u} are positively cor- 
related. Given the event {A(i, J) = u} that we observe a match of length at least 
u after (i, 7), the expected number of consecutive events {A(i, J) > u}, 
{AG + 1, 7+ 1 2 u}, {AG + 2, 7 + 2) = u},... associated with that match is 
1+p+p?+-+-- =(1-— p) !. The same phenomenon, clusters of average size 
(1 — p)~’, occurs in the analysis of the longest head run in a sequence of tosses of 
a p-biased coin. 

We need, therefore, to compensate for this clustering when approximating the 
distribution of M(m, n). This is achieved by introducing the random variables 
A'(i, J) = A(t, J)1(X, + Y,), where 1(£) is the indicator function for the event 
E. In contrast to the positive correlation of analogous events defined in terms of 
the random variables A(z, J), the events {A’(i, j) = u}, {A(i+1, 7+ 1) 2 u}, 
{Ai + 2, 7 + 2) = u},... are negatively correlated. The distribution of each 
A'(i, J) is a mixture, with weights (1 — p) and p, respectively, of the geometric( p) 
distribution and the unit mass on zero. If there were no other dependence, then 
M(m, n) would be like the maximum of mn independent copies of A’(i, 7), which 
in turn is like the maximum of (1 — p)mn independent copies of A(z, J). 

The second type of dependence occurs whenever the distribution a of the 
letters is not the uniform distribution, n, = 1/|S| for all a € S. The simplest 
aspect of this dependence is that a match at (i, s) is positively correlated with a 
match at (i,t) for all s#¢t: P(X,= Y, and X,= Y) = P(X,= Y, = Y) = 
Laes(Ha) 2 Œacslta)?)? = P(X, = ¥,)P(X, = Y,), with equality if and only if 


974 R. ARRATIA, L. GORDON AND M. WATERMAN 


g is uniform. It requires calculation, which we carry out in Sections 2—4, to prove 
that if m and n > œ with (logm)/(ogn) — 1, then this second type of 
dependence does not have a significant effect on M(m, n). In contrast, it is shown 
in Arratia and Waterman (1985b) that there exists a critical constant @,, € [0, 1) 
depending on p, with 6, = 0 iff » is uniform, such that if m and n —> œ with 
log(m)/log(n) => 8 e (0, œ), then 


(1.1) M(m, n)/(log, /p(mn)) >p1 iffée [6,,1/6,,]. 


See Section 5 for further discussion of the situation in which log(n)/log(m) > 
6 #1. 

We now describe, for general k, the framework that will be used to prove our 
theorems about the length of the longest k-interrupted match run between two 
sequences. The discussion above has been about the special case, k = 0. Fix an 
integer u to serve as a test level. 

We say that a k-interrupted run of length u is witnessed at (1, j) if X, + Y, 
and X,,,= Y,,, for all but at most k values of s, 1<8<u. We write 
R,(i, Jj; u) for the indicator of this event, which involves 2(u + 1) randomly 
chosen letters. When k = 0, we also speak of a (pure) run of u matches, witnessed 
at (i, J). We write N,(m, n; u) for the total number of witnesses at level u, and 
M,(m, n; u) for the length of the longest k-interrupted match, starting within 
the first m letters of one sequence and the first n letters of the other. Formally, 
we define, for integers 0 < k < u, the random variables 


Rali pu) = (X,+ yika E (Xss Ya), 
(1 2) lsssu 
N, = N,(m, n; u) = L R,(i, j; u), 
Ost<m,0sy<n 
and 
M, = M,(m, n) = max{u: N,(i, j; u) > 0}. 


In the special case of pure matching, k = 0, we omit the subscript k. Note that 
the event {M, = u} involves m + u X’s and n+ u Y’s. We show in Section 6 
that the natural alternate definition involving exactly m X’s and n Y’s is 
negligibly different from the above definition. 

Bonferroni’s (1936) inclusion—exclusion formulae, as used in Watson (1954), are 
the basis of our calculation that P(N, > 0) = exp(—EN,). Write B= 
Bim, n; u) = {b= (i, ju) 0 < it < m, 0 <J <n} for the set of indices associ- 
ated with potential witnesses to {M,(m,n) 2 u}. For finite CC B, write 
R,(C) = I},-¢R,(5) for the indicator that for each b = (i, j; u) € C, X, + Y, 
and X,,, +++ X,,, matches Y,,, -++ ¥,, with at most k mismatches. The rth 
term in the inclusion—exclusion series for P(N, > 0) is the expectation of the 
random variable 


(1.3) SP =8%(m,n;u)= E RC). 
CcB. |C]=r 


Truncation of the inclusion-exclusion series gives lower and upper bounds on the 


SEQUENCE MATCHING 975 


random variable 1( N, > 0): for 7 = 0,1,2,... 
E (DESP < 1(N, > 0) 


lsrs2r 
(1.4) r+l 
< È (D sp. 
lsrs2r+l 
Furthermore, for each j = 0,1,2,..., inclusion—exclusion (using S{° = 1) gives 


lower and upper bounds on the random variable 1(N, = /): for t = 0,1,2,.. 


yd) ie Pr )spen < (m, = j) 
O<srs2r-1 


(1.5) < F Gy [7+ "spon, 


O<rs2r 


These truncated inclusion—exclusion bounds allow us to prove, in Theorems 1 
and 1’, that the number N, of locations at which k-interrupted matches of length 
at least u begin is approximately Poisson, and that the locations at which these 
matches occur are approximately a Poisson process. To prove Theorem 1, we need 
to evaluate ES{”), for all positive integers r. For r=1, S{P = N,. For fixed 
r= 2,3,..., and any fixed t, Lemma 5 implies that as m and n > œ with 
log(m)/log(n) > 1, sia over integers u such that EN, < t, 


(1.6) Si — (EN,)"/r! > 0. 


Taking expectations in (1.5) and using (1.6), we obtain the result P(N, =j) > 
exp(— EN, EN,)//j!, uniformly in integers u such that EN, < t. Thus the 
Poisson distribution approximates the distribution of the number of witnesses to 
a match. 

For arbitrary m and n, there is the trivial upper bound, P(M,(m, n) = u) = 
P(N, > 0) < EN,, which should be used when EN, is small. Theorem 1 shows 
that, for m and n large with log(m)/log(n) near 1, 


P(M,(m, n) = u) = P(N, > 0) = 1 — exp(—EN,) 
= EN,(1 — EN,/2 + (EN,)’/6 - ---), 


so that the relative error in using the trivial upper bound is also small when the 
bound is small. 

At the level of establishing a limit distribution for the number of witnesses, 
N = N(m, n, u), the inclusion—exclusion framework described above is equiv- 
alent to the method of moments. To see this, note that for each d, the two sets of 
random variables, {S®, S®,..., S(®} and {N, N?,..., N°}, both have the same 
d-dimensional linear span. Thus, calculating finite limits for ES, ES®,... is 
equivalent to calculating finite limits for the moments of N. However, 
the inclusion—exclusion framework directly gives lower and upper bounds on 
P(N = 0), and thus could be used to get rate-of-convergence estimates for the 
distribution of M(m, n) — log, ,,((1 — p)mn). 


976 R. ARRATIA, L. GORDON AND M. WATERMAN A 

2. Combinatorial aspects of comparisons. In this section we discuss cer- 
tain combinatorial aspects of matching with shifts. There are two levels of graphs 
to consider. At the lower level, comparison graphs describe which comparisons 
are made between individual sites and their associated letters. At the higher level, 
adjacency graphs describe the overlap among bundles of comparisons. 

Recall from (1.2) that a run of matches must start with a mismatch. Therefore, 
it is notationally convenient to think of the sequences as beginning with one 
additional letter, which we index by 0. We shall see in Section 6 that the 
mathematically convenient definition of maximum run of matches is distribution- 
ally equivalent to the more intuitive definition of longest run of matches. 

In the rest of the paper, we consider two sequences of characters, Xp, X,, 
X... and Y, Y, Yo,..-, located in the Euclidean plane at the lattice sites 
Z+ x {1,2}. The process of comparing X, with Y, is represented as an undirected 
edge between (i, 1) and (3,2). We usually call this the edge or comparison (i, J), 
instead of using the more correct but awkward notation {(t, 1),{j,2)}. 

A set of u + 1 consecutive parallel comparisons is called a bundle, denoted by 
the triple (i, J; u) = {(i+ k, j+ k): k =0,1,...,u}. The definition of bundle 
anticipates the study of the random variables R,(i, J; u) introduced in (1.2). A 
set C of r bundles, all sharing the same value of u, is called a configuration or an 
r-configuration. Two bundles, b, = (is, J; u) and b, = (ip j; u) are said to be 
x-adjacent, denoted b, =b, if b, # b, and |t, — i,| < u. They are said to be 
y-adjacent, denoted b, =b, if b, + b, and |J, — j,| < u. The bundles are said to 
be adjacent, denoted b, = b, if either b, ~,b,, or b, =, b, or both. Note that, by 
definition, a bundle is not adjacent to itself. The two bundles, b, and 6,, are said 
to be parallel if j, — i, =j,— i. They are said to be doubly adjacent if both 
b, =, b, and b, =b, If two bundles are both adjacent and parallel, then they are 
doubly adjacent. 

The reader may wish to construct the following figure. The comparison of X, 
with Y, corresponds to the lattice point (i, j) in Z?. A bundle is then a set of 
u + 1 points lying on a 45-degree line. Parallel bundles overlap on these 45-degree 
lines. Adjacent bundles have overlapping x or y coordinates. Biologists perform 
visual analysis of matching by using such representations in what they call “dot 
matrix analysis.” See, for example, Waterman (1984). 

Associated with an r-configuration C are two graphs. The first is the compart- 
son graph, denoted U(C), because it is the union of the bundles in C. The 
comparison graph U(C) is a bipartite graph on Z* x {1,2} having at most 
r(u + 1) edges (comparisons). Each comparison connects an x-site in Z+ X {1} 
with a y-site in Z* x {2}. 

The second graph is the adjacency graph, denoted A(C), which is the 
restriction to C of the adjacency relation. Thus, the vertices of A(C) are bundles, 
and the edge {6,, b,} is in A(C) if and only if b, € C, b, E C, and b, = b,. The 
inclusion of edge {b,, b,} in A(C) indicates that some of the comparisons which 
constitute the bundles b, and b, share common sites. 

The comparison graph U(C) has a particularly simple structure if the adjac- 
ency graph A(C) is atomic. This case corresponds to a set of r mutually 
independent runs of matches, because none of the bundles in C overlap. The 


SEQUENCE MATCHING 977 


comparison graph U(C) then contains exactly r(u + 1) disjoint edges, each 
corresponding to some one comparison in some one bundle. We get our lower 
bound on ES‘ quite easily in Lemma 4, just by considering the contribution 
from configurations of this form. 

There are some configurations C of bundles whose comparison graph U(C) is 
quite complicated; see Guibas and Odlyzko’s (1981) discussion of periods in 
strings for a treatment of matching a single sequence to itself. In order to put an 
upper bound on the contribution ER(C) to ES“ from such a configuration, we 
select a reduced subset C* C C, so that ER(C) < ER(C*). The reduced con- 
figuration C* must be chosen small enough so that U(C*) is simple, but large 
enough so that ER(C*) is a useful upper bound. 

We now define the reduction operation, which reduces an r-configuration C to 
the (r — 1)-configuration C* = C — {b,}. A reduction removing b, from C is 
allowed if C contains two bundles b, and b, such that: 


(a) b, =, b, and b, =b, and 
(b) C — {b,} contains a pair of bundles which is adjacent but not parallel. 


Note that b, and b, need not be distinct. A configuration which cannot be 
reduced is called irreducible. 

Not every configuration can be obtained by reduction from a larger configura- 
tion. Neither can a particular configuration be obtained by reduction from very 
many other configurations. These two consequences of the definition are formal- 
ized in the lemma below. 


Lemma 1. Let C* be a given r-co ation and let t > r. The number of 
t-configurations, which yield C* after some series of t — r reduction operations, 
is less than [(2u + 1)t]*%¢-”. If none of the bundles in C* are adjacent, or if all 
bundles in C* are parallel, the number of such t-configurations is zero. 


ProoFr. Consider the adjacency graph A(C) of a configuration which is about 
to be reduced by deleting the bundle b = (1, J; u) from the configuration C. From 
condition (a) in the definition of reduction, there must exist bundles b, and b, in 
C, such that b, =,b and b, =b. Hence, at each of the ¢ — r stages of reduction, 
there are fewer than [(2u + 1)¢]? choices for (i, j), these choices being de- 
termined by the remaining undeleted bundles. Reduction never produces a 
configuration of mutually nonadjacent or mutually parallel bundles because of 
condition (b) in the definition. O 


The next proposition tells us that the comparison graph of an irreducible 
configuration has a relatively simple structure. 


LEMMA 2. Letr> 1 and C= {b,,...,6,} be an irreducible r-configuration 
having a connected adjacency graph A(C). Exactly one of the following three 
cases must occur. 

Case 1. No pair of bundles in C is doubly adjacent. In this case, every 
connected component of the comparison graph U(C) is a tree having r or fewer 
edges, and U(C) contains exactly r(u + 1) edges. 


978 R. ARRATIA, L. GORDON AND M. WATERMAN 


Case 2. C contains a pair of bundles which are doubly adjacent but not 
parallel. In this case, C consists of r = 2 bundles, every connected subgraph of 
U(C) is a simple path, and U(C) contains exactly 2(u + 1) edges. ` 

Case 3. C contains a pair of overlapping (i.e., doubly adjacent and parallel) 
bundles. In this case, all bundles in C are parallel, every connected subgraph of 
U(C) consists of a single edge, and there are fewer than r(u + 1) edges in U(C). 


Proor. Case 1. Assume that no two bundles in C are doubly adjacent. It 
follows from this that for each comparison e € U(C) there is a unique bundle 
b € C such that ¢ € b, and so U(C) has exactly r(u + 1) edges. 

Assume, in order to obtain a contradiction, that U(C) contains a cycle. Choose 
such a cycle having minimal length r. Since U(C) € Z* x {1,2} is a bipartite 
graph, r must be even, with T > 4, and we can label the sites along the cycle as 
(s(1), 1), (s(2), 2), (8(8), 1), ...,(s(7), 2). Recall that (i, 7) is our notation for the 
undirected edge {(i, 1), ( J, 2)}. The edges that form this cycle are the comparisons 


c, = (s(k),s(k+1)) if Rk isodd, 
= (s(k + 1), s(k)) if k is even, 


for k = 1 to 7, with the understanding that s(7 + 1) = s(1). For k = 1 to r, let 
b, be the unique bundle in C such that c, E€ b,. Now b, =,6,, 6, =, bp, and 
(b, b3) is a pair of bundles which are adjacent but not parallel, so that a 
reduction removing b, is possible. This contradicts the irreducibility of C and 
proves that U(C) contains no cycles. 

Now assume, in order to obtain a contradiction, that U(C) contains a con- 
nected subgraph G having more than r edges. By the pigeonhole principle, there 
is some pair of edges in G which both belong to the same bundle in C. Since G is 
connected, there is a path of edges in G which begins with one of these two edges 
and ends with the other. Choose such a path in U(C), ¢,,¢2,...,c,, to have 
minimal length r. For k = 1 to r let b, € C with c, € bp, so that b, = b,. It is 
not possible that r = 3, for then b, and b, would be doubly adjacent. The pairs 
(bi, b2), (be, b3), and (b,, b,) are alternately x-adjacent and y-adjacent. By 
hypothesis b, and b, are not doubly adjacent, so they are not parallel. Hence a 
reduction removing b, from C is possible. This contradicts the irreducibility of C 
and proves that no connected component of U(C) has more than r edges. 

Case 3. Assume that C contains a pair of doubly adjacent, parallel bundles, 
b, and b, Any bundle adjacent to b, must be parallel (and hence doubly 
adjacent) to b,, since otherwise C could be reduced by the reduction removing },. 
Similarly, any bundle adjacent to b, must be parallel to b, Since A(C) is 
connected, it follows by induction that C consists of r mutually parallel bundles. 

All comparisons in U(C) are parallel, so no two distinct comparisons can 
intersect, and hence every connected subgraph of U(C) consists of a single 
comparison. There will be fewer than r(u + 1) comparisons in U(C) because b, 
and b, share some comparisons. 

Case 2. At this point, we have disposed of any irreducible C having no 
doubly adjacent pair of bundles in the proof of Case 1. We have disposed of any 


SEQUENCE MATCHING 979 


irreducible C having a pair of adjacent and parallel bundles in the proof of Case 
3. Therefore, we may assume that C contains a pair of doubly adjacent bundles 
and that no pair of adjacent bundles in C is parallel. 

Let b, and b, be two bundles in C which are doubly adjacent but not parallel. 
C cannot contain a third bundle, adjacent to b, (and hence not parallel to 8,), 
since otherwise C could be reduced by the reduction removing b, Similarly, C 
cannot contain a third bundle, adjacent to b,. Since A(C) is connected, it follows 
that r = 2 and C = {8,, 5,}. 

No comparison can belong to two nonparallel bundles, so U(C) contains 
exactly 2(u + 1) comparisons. Label the two bundles in C as (i,i+ k; u) and 
(i', i’ + k’; u). A site (s, 1) in U(C) can only share an edge with sites of the form 
(s + k, 2) or (s + k’,2), and a site (s, 2) in U(C) can only share an edge with sites 
of the form (s — 2,1) or (s — k’,1). From this it follows that the only connected 
subgraphs of U(C) are simple paths. O 


3. Estimates using Jensen’s inequality. Recall that the letters 
Xo, X1,---) Yo, Yp--- E S are assumed to be iid. (4), for some nontrivial distri- 
bution u. Let p, be the probability that r different letters match, for r = 1,2,..., 
s0 
(3.1) P= È (Ha): 

aeS 
and p = p,. Jensen’s inequality [Hardy, Littlewood and Pólya (1934), formula 
2.10.3] says that for 0 < r < s,(p,)!/* < (p,)'”". Thus for s = 3,4,..., p, < p*”. 
Let 6 = (4) be defined by (p) = p*p'”, so that p? =[Z,(u,)"]? = 
[E(ux)]? < E[(uy)?] = Eata)? = ps < p°” implies that 0 <6 <j. Thus 
Jensen’s inequality, p, < (p3)*”* for s = 3,4,..., yields 


(3.2) P, <p*/*p** fors = 3,4,.... 

LEMMA 3. Given a nontrivial distribution u on a countable alphabet S, let 
p€(0,1) and ô € (0,}] be defined as in the discussion preceding (3.2). For 
r 2 2, let C be an irreducible r-configuration having connected adjacency graph 


A(C). The following estimates hold, for k = 0,1,2,.... 
Case 1. If no two bundles in C are doubly adjacent, then 


E{R,(C)} < p Typ + DAupsu_ 


Case 2. If C consists of r= 2 bundles which are doubly adjacent but not 
parallel then 


E{R,(C)} < p*u?*p*p**, 


Case 3. If all bundles in C are parallel, and C = {(i + a,, j+ apu) $= 1 
tor}, with 0 =a, <a, < -:- <a,=a, then 


E{R,(C)} <p **yrkp?e if a >u, 
E{R,(C)} =0 ifasuandr>k+1, 


980 R. ARRATIA, L. GORDON AND M. WATERMAN 


and 
E{R,(C)} < ut akpste-2k ifa<uandr<k+1. 


Proor. For any bipartite graph G C Z* x {1,2}, define the event Eg = 
{X, = Y, whenever the sites (i, 1) and (j,2) are connected in G}. Let ¢ denote a 
connected component of G, and write |t| for the number of vertices in that 
component. Since all letters are i.i.d. (p), 


(3.3) P( Eg) = ICPn). 


Cases 1 and 2. Lemma 2 says that U (C) has exactly r(u + 1) edges, and each 
connected subgraph of U (C) is a tree. 

Choose and fix, in each of the r bundles, some k comparisons after the first 
comparison. There are F < u"* ways to do this. Delete these edges, together 
with the first comparison from each of the r bundles in C, from the graph U (C), 
to form a subgraph G which has exactly r(u— k) edges. In order for the 
indicator R,(C) to be 1, there must be some such choice of G, with X, = Y, 
whenever the sites (i, 1) and (J, 2) are connected in G, hence 


(3.4) E{R,(C)} < LP(Es) < u"*max P( Eg). 


Let £ be a connected component of G, so that t is a tree with |t| vertices and 
iż] — 1 edges. Let 


r= Yo 1 and O= YE (le|-1), 
t: |t{23 t. |t/23 
so that + is the number of connected components which have 2 or more edges, 
and @ is the total number of edges contained in these r trees. Since G has 
r(u — k) edges, G has exactly r(u — k) — 0 components t having |t| = 2. From 
(3.2) and (3.3), 


(3.5) P( Eg) = Tpu < pPUrh)-8 (04 1)/2 984198, 


The first factor above comes from trees consisting of a single comparison. The 
last two terms come from inequality (3.2), applied to the r trees consisting two or 
more comparisons, with a total of 0 edges connecting 0 + 7 sites. 

In Case 1, every connected component of U(C) has at most r edges. Thus 
every component t of G has at most r edges, so + > 8/r. Substitute this into 
(3.5) to get the first of the two inequalities below: 
aa) P( Eg) < pP- 8p 84 8/22 p84 G/rN8 

< poh per A+, 


The second inequality is valid because 6 € [0, ru]. To check this inequality, take 
logarithms to get an inequality which is linear in ĝ, and thus only needs to be 
checked at the endpoints. At the endpoint @= 0, we use 6 < i, and at the 
endpoint @ = ru, we use ô > 0. Combining (3.4) and (3.6) establishes Case 1. 


SEQUENCE MATCHING 981 


In Case 2, substitute r= 2 and +20 into (3.5) to get the first of the 
inequalities below: 


P(E < 2Qu—-k)—-G,,0/2,08 
(3.7) (Eg) <p pp 


< | aes | Na 
As in Case 1, the last inequality is valid because 8 € [0, 2u]; it checks at 0 = 0 
using ô < 1, and it checks at @=2u using ô > 0. Combining (3.4) and (3.7) 
establishes Case 2. 

Case 3. If a> u, then the bundles b, and b, are not adjacent, so R,(b,) 
and R,(b,) are independent. Hence ER,(C) < ER,({b,, 6,}) = [ER,(b,)]? < 
[u*p“—*]? and the bound is established. 

If r>k+1 and a<u, then R,(C) is identically zero, because to have 
R,(C) = 1 requires that there are nonmatches at (i + a,, J + a,) for s = 2 tor, 
which forces r—- 1 > k + 1 nonmatches between X,,, © X,,, and Ypi °°: 
D 
g Finally, consider the case r < k + 1 and a < u. To have k-runs witnessed at 
bundles 6,,..., 6,, there must be nonmatches at comparisons (i + s, 7 + s) for 
s € {a,,...,a,}. Thus to have a k-run at b,, there can be at most k — (r — 1) 
nonmatches at comparisons (i + s, J + s) for s € {1,2,..., u} — {a,,...,a,}. To 
have a k-run at b,, there can be at most k nonmatches at comparisons (i + 8, 
j+ s) for s € {u+ 1,...,u +a}. Choose a subset of size k—r+1 from 
{1,2,...,u} — {ag,...,@,}, and a subset of size k from {u+1,...,u+ a}, 
and require matches at (i +8,jJ+s) for the u+a-—2k values s in 
{1,2,..., u + a} — {ay,...,a@,} and the two chosen subsets. The two subsets can 
be chosen in < (u—r+1)*-"*!a* < u*-"*1a* ways, Thus 


u 


E{R,(C)} < pig eae: o 


4. Approximate distribution of the longest match run. In this section 
we compute the approximate distribution of the longest k-interrupted run of 
matches. We combine the combinatorial work on reduction of Section 2 and the 
probability bounds of Section 3 with Watson’s (1954) formulation of the inclu- 
sion—exclusion principle to obtain our main result, Theorem 1. Using (1.4) and 
(1.6), we need only show, in an appropriate sense, that E{SẸ (m, n; u)} con- 
verges to [E{S{(m, n; u)}]’/r!. We obtain the uniform lower and upper bounds 
for this in Lemmas 4 and 5, respectively. 

Throughout this section, we adopt the convention that m < n. We write 
q = 1 — p, where p, the probability of a match on a given comparison, is defined 
in (1.1). Let G,(u) = E{R,((0,0; u))}/g = the probability that there are no 
more than k nonmatches among the u comparisons (1, 1), (2,2),...,(u, u). Since 
G,(u) is the probability of at least u — k “failures” (matches, each with probabil- 
ity p,) before the (k + 1)th “success” (nonmatches, each with probability q), 
G,(w) is the upper tail probability of a negative binomial distribution. It is easily 


982 R. ARRATIA, L. GORDON AND M. WATERMAN 
seen that 


(41) G,(u)= È (png = q**{d/dp)" (p“/(1 - p)}/k! 

tzu 
for integers u > k. We use the final expression above to define G,(u) for all real 
u, and note that 


(4.2) G,(u) ~ (qu/p)*p"/k! asu > œ. 
Lemma 4. If 2u> randn 2z m> 4ru, then 
E{S{(m, n; u)} > [1 — 27u(m + n)/(mn)]”[qmnG,(u)|"/r!. 


Proor. S{”(m,n;u) was defined in (1.3). We can bound E{S{"(m, n; u)} 
below by only summing over those configurations C consisting of r mutually 
nonadjacent bundles in B(m, n; u), the set of bundles whose initial comparison 
connects one of the first m x-sites with one of the first n y-sites. For these C, 
R,(C) is the product of r-independent indicator random variables, each 
with expectation gG,(u), so ER,(C) = (qG,(u))’. In choosing such a configu- 
ration C one bundle at a time, the added bundle can start at any site at least 
u sites away from any of the previously chosen bundles, allowing at least 
[m — (r — 1)2u + 1)][n — (r — 1)2u + 1)] > mn — 2ur(m + n) choices for the 
additional bundle. Hence there are at least [mn — 2ur(m+n)]’/r! such r- 
configurations C. O 


LEMMA 5. Choose and fix k > 0, r > 1, and e > 0. Let m, n > œ in sucha 
manner that log(m)/log(n) > 1. Let 


(4.3) U(mn) = {u: mnu*p" € [e, e7*]}. 
Then 


sup |{(qmnG,(u))’/r!} “E{S,(m, n;r,u)} — 1| > 0. 


ue U(mn) 


Proor. The crucial observation is that the random variable R,(C) is a 
product of mutually independent indicators which correspond to connected 
components in the adjacency graph A(C). Further, the reduction operations let 
us study components whose graphs in U(C) are of particularly simple form. 

We assume that m < n and that u>1. We write »,7,... for constants 
which depend on p, k, e, and r, but not on m, n, or u, and whose exact value is 
not of interest. 

Consider a particular r-configuration C. If C is reducible, then it may be 
reduced to some irreducible configuration C* c C; if C is irreducible, we take 
C* = C. Let C,,...,C, be the connected components of A(C*), so that the 
adjacency graphs A(C,) are individually connected, but no pair of comparison 
graphs U(C,) and U(C,) share any sites. Hence the indicators R,(C,), s = 1 to 7, 
are functions of disjoint sets of independent random variables, and so are 


SEQUENCE MATCHING 983 


independent. Thus 
ER,(C) < ER,(C*)= |] ER,(C). 
ls<sst 


Since C* is irreducible, so is each C,. To make use of Lemma 1, let 0(C*) = 1 if 
|C*|=r and 6(C*) = nu”, with n = (4r)? if |C*| <r. Write S, for 
Si (m, n; u). We have 
(4.4) ES, < 2 o(C*) T] E{R,(C,)}. 

{C* irreduable} 8st 
The sum is taken over all irreducible configurations C* which are r-configura- 
tions, or which can be obtained by reduction from an r-configuration. The 
product is taken over the r subsets C, of C* which correspond to connected 
components in the adjacency graph A(C*). 

Note that if 0(C*) > 1, then C* was obtained from an r-configuration by 
reduction, so there exists some pair of bundles in C* which are adjacent but not 
parallel, and hence at least one of the C, belongs to Cases 1 or 2 in Lemma 2. For 
J = 1,2,3 let o, be the sum of ER,(C*) over all p-configurations C* belonging to 
Case j of Lemma 2, with C* irreducible, A(C*) connected, and 2<p<r. 
Proceeding from (4.4), and assuming u € U in (4.3), we have 


(4.5) ES, < {gmnG,(u) + r!no|[nu?"(o, + 0.) + 33] } Jr! 

From (4.2) and (4.3), gmnG,(u) is bounded away from zero and infinity. The 
factor no appears in (4.5) with each o, to compensate for possibly spurious 
multiplication by factors gmnG,(u) when taking the rth power. Because 
u € U(mn), no < €”. The multiplication within braces by r! cancels the effect of 
division by the final r!. It is used to count r-configurations in which no two 
bundles are adjacent. The factor nu?” which multiplies o} and a, Eee to 
those C* with 0(C*) > 1, and is obtained from Lemma 1. 

We now analyze the contributions to (4.5) of the sums o}, og, and a3. 

Case 1. Because we have assumed that m < n, there are at most 
pimn([m + n][2u + 1])°~* < y,mn’u? p-configurations C,, with p < r, in which 
no two bundles are doubly adjacent, but A(C,) is connected. Using Lemma 3 for 
the first inequality and (4.3) for the second inequality, we get 


nus, < £ namn? u? u? +tPkpule+D/2+8) 
psr 


< L namn’ u?” ++" mn) —(p+1)/2+8) 
esr 


= D ng(n/m)?~ DX mn) ~8 arth 
Psr 
Since log(n)/log(m) > 1, it follows that (n/m)°~)7n~?/2 => 0. Satan (4.3) 
implies that u ~ log, ,,(mn), hence u°”**"(mn)~*/” — 0, and so u?"o, — 0, uni- 
formly over u € U. 
Case 2. There are fewer than mn (2u + 1)? configurations consisting of two 
bundles which are doubly adjacont but not parallel. Using this and Lemma 3, 


984 R. ARRATIA, L. GORDON AND M. WATERMAN 


o, < nymnu?u?*p4*48, We now employ (4.3) to obtain uo, < y,u2*?"***p > 
0, uniformly over u € U(mn). 

Case 3. Let C* = {(it+a,,j+a,;u): s= 1,2,...,p} with 2 <p <r and 
0 =a, <a, < --- <a, =a. Assume also that A(C*) is connected. Because 
A(C*) is connected, there are fewer than mn[min(a, u)]°~' such p-configura- 
tions. Use Lemma 3 for the first line below, where the first sum corresponds in 
Lemma 3 to the sub-case r > k + 1 or a > u, and the second term corresponds to 
the sub-case r < k + 1 and a < u, and then use (4.3), 





—1,,24,,2u —l+k,,k-ptl,ut+a 
o <ne) >, mnurlur*p+ YS YY mna Tams as 1) 
2spsr 2<psrisasu 
k+p-l —1+k,,-p+1 
emj Lutte ipt+ L Year i *tyret ips 
2<psr 2<psrisa 


< nal tt" tp" + ua] > 0. 

Combining the results for Cases 1, 2, and 3, it follows that the upper bound 
(4.5) is of form {qmnG,(u) + o(1)}"/r!, uniformly over the set U in (4.3). This 
upper bound, together with the lower bound from Lemma 4, establish Lemma 5. 

O 


It is now an easy matter to prove our main theorem about large sample 
approximations to the distribution of M,(m, n). We write 
valn) = {log, ,,(”) +k log, ,p!og, (7) 
+k log, /p(4/P) a log, ,p(#!)}- 


The solution of the equation nG,(v) = 1 is almost v,(n). Because of (4.2), 
nG,(v,(n)) > 1 as n > œ. We write |-| for the greatest integer function, and 
B(x) =x — |x]. 


(4.6) 


THEOREM 1. Let m,n —> œ in such a way that \og(m)/log(n) > 1. Write 
A = A(k, m, n; u) = gnmG,(u). Then 


(4.7) sup | P(M,(m, n) =u} — {1 — exp(—A)}| > 0. 
In addition, for every fixed t, andj = 0,1,2,..., 
(4.8) sup | P(N,(m, n; u) = j} — exp(—A)(A)’/7!| > 0. 


{u EZ: |ju~ vmn) t} 


Proor. First fix ¢, and note that {mnu*p": |u — v,(mn)| < t} is bounded 
away from zero and infinity. Thus for each r = 0,1,..., the expectations of (1.4) 
and (1.5) can be evaluated in the limit as m, n > oo, uniformly over integers u 
such that |u — v,(mn)| < t, by means of Lemma 5. By taking r sufficiently large, 
(4.8) is proved. For the special case j = 0 of (4.8), both P{M,(m, n) 2 u} and 
{1 — exp[—A]} tend to 0 [or 1] as u — v,(mn) tends to +œ [or — oo]. This 
establishes (4.7). 0 


SEQUENCE MATCHING 985 


The next theorem says that the locations (i, J) along the two sequences, at 
which long matches occur, are distributed approximately independently and 
uniformly throughout their possible range [1, m] X [1, n]. Formally, we define 
the random measures ¢ = §,(m, n; u) on [0,1]? by 


f= 2 ô, jm, j/nRa( m, n; u), 


i<m,y<n 


where ô, , denotes unit mass at the point (x, y). The total mass of § is 
¿([0, 1]? j= = = N,(m, n; u), which by Theorem 1 is approximately Poisson in distri- 
bution, with parameter A. Theorem 1’ says that the point processes € are 
uniformly close to the corresponding constant intensity Poisson processes on 
[0,1]*. 


THEOREM 1’. Let B,,..., By be disjoint rectangles in [0,1]”. Let a, be the 
area of B, and let A = qnmG,(u). Fix tand letj,,..., Jı be nonnegative integers. 
If m andn — œ with log(m)/log(n) > 1, then 


sup — | P(&(B,) =J, fori= 1 tod) ~ T] exp(—Aa,)(Aa,)’i/f!| > 


u: |u-o,(mn)| st 





Proor. For each i, a formula like (1.5) gives upper and lower bounds on the 
indicator function 1(&(B,) = j,) in terms of sums of indicators R,(C) with index 
sets C of size |C| < 27. After multiplying out, this gives upper and lower bounds 
on 1(¢(B,) = J, for i = 1 to d) in terms of sums of indicators R,(C) with index 
sets C of size |C| < 27d. For the expectations of these bounds, Lemma 5 and the 
proof of Lemma 4, applied with r = 1,...,27d, show that the combined contribu- 
tion from those C corresponding to dependent events is uniformly negligible, and 
that the expectations for these bounds are uniformly close to what they would be 
if all the indicators R,(i, J; u) were mutually independent. By taking r suffi- 
ciently large, the theorem is proved. 0 


In Theorem 2, we establish the most useful results about the approximate 
distribution of M,(m, n). Although the main result is a consequence of Theorem 
1 of Ferguson (1984), we have chosen to follow the constructive approach of 
Gordon, Schilling and Waterman (1986) for three reasons. First, that approach 
makes clear the choice of centering constant v,(-). Second, the representation of 
assertion (c) makes clear the relationship of the various limiting distributions 
along the appropriate subsequences. Third, the representation (c) and its use of 
B(x) = x|x| suggests the Fourier methods which enable us to compute the 
expectations in assertion (e). 

Before stating Theorem 2, we note as in Gordon, Schilling and Waterman 
(1986) that (d/du)G,(u) is the product of p“ times a polynomial in u whose 
leading coefficient is negative. Hence, for fixed p, there exists some constant uy 
such that G,(u) is continuous and strictly decreasing for all real u > uy. 

We denote by Z,, Z,,... an ii.d. sequence of absolutely continuous random 
variables such that P{Z, > u} = 1 — G,(u) for all real u > uy. We denote by W 


986 R. ARRATIA, L. GORDON AND M. WATERMAN 


a standard extreme value random variable having cumulative distribution func- 
tion exp(—e~“*). Recall that W has expectation y, the Euler-Mascheroni con- 
stant 0.577... and has variance 77/6. We write J = log(1/p). The bounds on the 
remainder terms of assertions (e) and (f) are very similar to those obtained in 
Boyd (1972). 


THEOREM 2. Let m,n > œ in such a manner that log(n)/log(m) ~ 1. We 
may conclude that: 


(a) P{M,(m, n) > u} ~ P{ |max{Z,: s < gmn}|>u} > 0, uniformly in u, 
for all real u. 

(b) P{M,(m, n) > u} — P{|W/1+ 0,(qnm)|> u} > 0, uniformly in u, for 
all real u. 

(c) P{(M,(m, n) > u + v,(qnm)} — P{|W/1 + B,(v,(gmn))] — B,(0,(¢mn)) 
> u} > 0, uniformly in u, for all real u. 

(d) The random variables {M,(m, n) — v,(qnm)}? are uniformly integrable. 

(e) E{M,(m, n)} = 0,(qnm) + y/l- 4 + r,(m, n) + o(1), where 6 = 77/1 
and |r,(s)| < (2r) 0e t — ety, 

(f) Var{M,(m, n)} = 77/(6l?) + 4, + n(m, n) + 0(1), where |r(s)| < 
2(1.1 + 0.70)6'/e-%1 — a= 8)-3, 


Proor. Assertion (a) follows as an immediate consequence of Watson (1954). 
The random variables Z, are independent, and sG,(v,(s + u)) > p” for all real u 
as s > œ. Let L, = max{Z,: j < s}. We conclude from Watson (1954) that 
L, — v,(8) converges in distribution to W/I and that P{L,,,, 2 £} is approxi- 
mated by exp(—qnmG,(t)) uniformly for all integers t. 

Fix r > 0. Let {u,} be any sequence of integers for which |u gnm — 0,(qnm)| < T. 
Theorem 1 tells us that P{M,(m,n) 2 u,,,,} is approximated by 1 -— 
exp(—gnmG,(U,,,,)), uniformly over such sequences {u,}. For any integer t, 
P{ Lanm 2 ty = P{ Lanm] 2> t}. Hence, P{M,(m, n) = t} — P{ Pan H} >00 
uniformly in integers t for which |¢| < r. Because both M,(m, n) and |L,| have 
as support the integers, it follows that P{M,(m, n) > u} — P{ | Lonm| > u} > 0 
uniformly over all real u. 

Assertions (b) and (c) follow from (a) because L, — v,(s) converges in distri- 
bution to W/L. 

We next prove (d) which asserts the uniform integrability of the random 
variables M,(m,n)*. In order to simplify notation, we write 7,7, ng... for 
constants whose exact values are not material to the proof. We also write v for 
v,(qnm). . 

Because the function s*p* is bounded, we obtain as a consequence of (4.2) that 


(4.9) P{M,(m,n) 20+ t} <nmG,(o+ t) < nt*p' foranyt> 1. 


Choose ¢ to be a constant between 0 and 1, which we will specify later. We now 
use Chebyshev’s inequality to bound P{M,(m, n) < v — t} for 1 < t < (1 — §)o. 
Note that P{M,(m, n) < v — t} = P{S{(m, n; v — t) = 0} and that the latter 
probability involves a sum of indicator random variables most pairs of which are 


SEQUENCE MATCHING 987 


independent. We assume without loss of generality that m < n. An argument 
identical to that in the proof of Lemma 5 shows that 
Var{S{(m, n; v — t)} < nmG,(o — t) + (mnvo, + nmvo, + nmoz), 


where the first term is attributable to the variances of nm Bernoulli random 
variables, and the terms o,, 0, and o, are contributions of products of indicators. 
These products correspond to Cases 1, 2, 3 of Lemmas 2 and 3, as applied to 
2-configurations which are, respectively, singly adjacent, doubly adjacent but not 
parallel, and both parallel and doubly adjacent. 

From Lemma 3, these terms may be bounded: 


os v2*p@/2+8Ke-2 


< nat (mn) $t? p762 
Oy $ nalo ~ t) p0 +at-0 
< myoH(mn) 0” p0, 
and 
oz < q0™?(mn) >. 
z hypothesis, E{S{(m, n; v — t)} = nmG,(v — t) > np~'. Hence, because 
i 
(4.10) P{M,(m, n) <v — t} < np“ forl<t<(1-$)o. 
Now break the sets {X,} and {Y;} into m/(Sv) disjoint pieces to obtain: 
P(M,(m, n) < Sv} < (1 = G40)" 
(4.11) < exp(—mG,($0)/[$0]) 
< exp(—mo~'(nm)*/n) < exp(—m="/n). 
The final inequality follows because log(n)/log(m) —> 1. Now choose and fix 
= 4. Combining (4.9), (4.10), and (4.11) gives 
E{|M,(m, n) — v| 1(|M,(m, n) — v| > r)} < ppm /9 + Ze 


and so we have established the uniform integrability of |M,(m, n) — v,(qnm)|. 

Assertion (e) follows exactly as in Gordon, Schilling and Waterman (1986), 
because assertion (d) lets us approximate the mean and variance of M,(m, n) by 
the corresponding moments of 7 + | W/L -— n]. 0 


5. The relative growth of the two sequence lengths must be con- 
strained. Here is a calculation that shows why the hypothesis “log(m)/log(n) 
—> 1” is needed to state Theorems 1 and 2 in a form with minimal conditions on 
u. This calculation shows that if (log m)/(log n) does not converge to 1, then a 
suitable choice of u makes the second moment of N = N (m, n; u) blow up. 
N,(m, n; u), the count of witnesses to matches having length at least u, was 
defined in (1.2). 


988 R. ARRATIA, L. GORDON AND M. WATERMAN 


Fix 8 € (0,1) and take m,n > œ with log(m)/log(n) > @. Consider only 
configurations C of the form {(1, r; u), (t, s; u)} with r + s, i < m, and r,s <n. 
The number of these configurations is 2 |. Denote by |-] the least integer 
function. Take u = flog, /p((1 — p)mn)} so that c = u — log, ,,((1 — p)mn) € 
[0,1) and EN = p° € (p,1]. Note that logy /p(( 5 )m) ~ log,,,(mn?) ~ 
(log,,,(mn))(2 + 0)/ + 0) ~ u(2 + 0)/(1 + 0). We have R(C) = 
1(X, + WLX + Y LX, = Y= Ypy; for t=1 to u), and ER(C)= 
(Eakall — Ha) X pa)“ where p; =L,(#,)”. Thus log(ER(C)) ~ u log( ps). The 
combined contribution to ES® from these (‘)m configurations is (} \mER(C) 


with log, (5 )mER(C)) ~ ul(2 + 6)/( + 6) — log,( pa)]. Jensen’s inequality 
says that p, > p*”. 

Given any 6 < 1, if log,( ps) is sufficiently close to 3, then the coefficient of u 
in the expression for log, a » |mER(C)) is positive. In that case ES? > œ. 

For fixed p € (0,1), there are examples of p having Laes(Ha)? =p and 
log „( p3) arbitrarily close to Jensen’s bound, 2, although the alphabet S may also 
have to be large. However, for any particular u, the condition “(log m)/(log n) > 
1” might not be necessary for the conclusion of Theorems 1 and 2 to hold. 

Consider the above argument in more detail. Let 6, = [2 — 
log ,( P3)]/Ulog,( p3) — 1]. Elementary manipulation shows that 6,20, with 
equality iff u is uniform. In the case u is nonuniform, and m and n > œ with 
(log m)/(log n) > 6 < 6,, the above argument shows that the number N of 
witnesses to a match of length at least c + log, ,,((1 — p)mn) is bounded in L 
and unbounded in L?; however, this argument fails to resolve whether the family 
of random variable {M(m, n) — log,,,(mn)} is tight. We conjecture that for 
each p there exist centering constants u(m, n), namely u(m,n) = E(M(m,n)), 
such that the family {M(m, n) — u(m, n): m,n € Z+} is tight. 

In this paper, we have analyzed M(m,n) by the position at which 
the match appears. In contrast, an analysis of M(m,n) by the pattern 
which is matched is carried out in Arratia and Waterman (1985b). In order 
to describe the result of that analysis by pattern, consider the distribution 
a defined by a, = P(X, = a|X, = Y,) = (pa)?/p, and note that p= a iff p 
is uniform. Let H(«, p) be the relative entropy, H(a, p) = L,.ga,log(a,/p,), 
so that H(a, u) > 0, with equality iff p is uniform. Formula (1.1) in the 
introduction to this paper gives necessary and sufficient conditions for 
‘M(m, n)/(10g, ,,(mn)) >p 1 when m grows like n°; the critical value there is 
given by ba = H(a, »)/(og(1/p) — H(a, 1)). 


6. Boundary effects are negligible. To be strictly accurate, we note that 
M,(m, n), defined in (1.2), is not really the length of the longest k-interrupted 
matching between X, -+ X,, and Y, --- Y,. Instead, M,(m, n) is the maximum 
of an m by n block of random variables from a two-parameter stationary family, 
and the event {M,(m, n) > u} involves m + n + 2u letters, X, through Xm+u—ı 
and Y, through Y,,, ,_,. Given actual data, X, --- Xm and Y; --- Y,, the length 
Mm, n) of the longest k-interrupted matching is defined below; note that it is 


SEQUENCE MATCHING 989 


a nontrivial function of all m + n letters. Here for comparison are the two 
definitions: 


Mf = M{(m, n) = max{u: X,,, --- X,,, matches Y}, e Ya, 
with at most k mismatches, for some 


O<ism-—u,0<sj<n- 4}, 
M, = M,(m, n) = max{u: X,,, ++: X,,, matches Y; = Y, 


Jt+u 


with at most k mismatches, and X, # Y, 
for some 0 <i < m, 0 <j <n}. 


In Lemma 6 we show that M#(m, n) is on the order of log(mn), provided that 
each sequence is at least this long. Theorem 3 essentially says that boundary 
effects are negligible if and only if each sequence is many times longer than 
Mé(m, n). 


LEMMA 6. Suppose the letters {X,} and {Y,} are i.i.d. (p) from a finite 
alphabet S. Let H = —L,cst,log(u,) > 0 be the entropy of p, and let p = 
Laes(Ha)? Fix k > 0. For any e > 0, as m andn > œ, 


P(min(m, n, (1 — e)log(max(m, n))/H) < Mį(m, n) < (1 + e)log, „(mn )) 


>l. 


Proor. The upper bound comes from Chebyshev’s inequality, as discussed 
following formula (1.5), and using (4.2) in case k > 0. For the lower bound, it 
suffices to consider the case k = 0, since M4m, n) = Mdm, n) < M£(m, n) for 
all m, n, and k. Without loss of generality we may assume that m < n. 

Let t = min(m, |(1 — e)log(n)/H|). Write W for the random word 
W=X, --: X, €S’ We will show that with probability tending to 1, 
the word W appears somewhere in Y, --- Y,, so that M%(m,n) > t. Since 
(1/t)log(u(W)) > —H in probability, P(u'(W) > exp(-(1 + e)Ht)) > 1. For 
large n, exp(—(1 + e)Ht) > exp(—(1 — e*)logn) > t?/n. Now in n/t indepen- 
dent trials, each with success probability > t?/n, the probability of no successes 
is bounded above by (1 — £?/n)"/* < e~t > 0. Hence 

Piast --- Y= W forsome 0<j<n/tly'(W)> t?/n)} se 


J 


and the lower bound is proved. 0 


THEOREM 3. Suppose the letters {X,} and {Y,} are i.i.d. (u) from a jinite 
alphabet S, with O < p, < 1 for all a E S. Fix k > 0, and let m and n > œ 
along a given sequence (m, n,). The following conditions are equivalent: 


(a) log n)/m — 0 and (log m)/n > 0, 
(b) P(Mf(m, n) z M,(n, n)) = 1. 


Proor. Fix any e > 0, and let t = t(m, n) = |a + e)log, /,(mn)|, so that 
by Lemma 6, P(M? > t) > 0. We place the letters X, X, --- X,.4;-; around a 
circle of length m + t, and similarly place the letters Y,Y, --- Y,,,-, around a 


n 


990 R. ARRATIA, L. GORDON AND M. WATERMAN 


circle of length n + t. Let L,, be the length of the k-interrupted match following 
position (z, 7) on this pair of circles: L,, = max{u: X,,, tt X,,, matches 
Yace Yau with at most mismatches}, where the indices for X are taken 
modulo (m + t) and those for Y are taken modulo (n + t). 

We show that (a) implies (b). Let A = Z? N [0, m + t) x [0, n + t), and let 
B=Z?(t,m—t)X(t,n—t). Let E = E(m,n) be the event {max,, yen 
L,, < MAXa, »,¢a—pl,,}- Since the two-parameter family (L,,) is stationary, 
P(E) <|A — BAI, and thus condition (a) implies that P(E) ~ 0. Note that 
{M? + M,} c E U {M£ 2 t}, so that P(M? + M,) > 0, which is condition (b). 

Now we assume that (a) fails, and show that (b) fails. Without loss of 
generality we may assume that m < n and (log n)/m is bounded away from 0. 
We need to show that P(M# + M,) is bounded away from zero, Let L, 
(L,,)1(X, # Y,), still taking the index ¿ modulo (m + ż¿) and the index j e E 
(n + t). Let s = min(m — 1,(1 — e)log(n)/H). Let D be the event {s < ME < t}. 
By Lemma 6, P(D)>1. Let T= Z?O[1,m) X[1,n), and let J=Z?n 
[m — s, m) X [l, n). Let F be the event {max,_,L,,< max,L,,}. Since 
the two-parameter family (L; j) is stationary, P(F) > FT| = s/m. By the 
assumptions that (a) fails and that m < n, &/m is bounded away from zero, and 
hence P( F’) is bounded away from zero. 

Let G be the event {max,, pe- sly < max,, e zL}. Note that G A Dc 
(Mf? + M;}. Since P(D) > 1, we need “only show that P(G) is bounded away 
from zero in order to conclude that (b) fails. Let c = min, -g,/(1 — Ha), 80 that 
c € (0,1]. Consider the map from F A GN D into G defined informally as 
OUENS: Let u = max; yl = max ,L,,, and find the match X,,, 0 X,4,= 
Yer ++ Y4, with (i, 7) E J which minimizes i and j. Then change the letter 
ie so that it agrees with Y,,,,,,. This map f is many-to-one, but P(B) = 
cP{ f-'(B)) for any B c range( f ). Hence P(G) = e(P(F) ~ P(G) — P(D*)), so 
lim inf( P(G)/P(F)) = c/(1 + c). This shows that P(G) is bounded away from 
zero. O 


7. A biological example. The complete DNA sequence of the virus lambda 
can be represented as a word consisting of 48,502 letters from the alphabet 
{A, C,G, T}, with respective relative frequencies {0.2543, 0.2343, 0.2643, 0.2471}. 
See Sanger et al. (1982). The virus is a bacteriophage that infects the bacterium 
Escherichia colt. As of mid-1984, the lambda sequence is the longest completely 
known genetic sequence. 

Because viral sequences do not seem to contain repeated or redundant regions, 
we do not expect disjoint segments to contain long matching regions. As a test of 
the utility of the asymptotic theory, we started at the beginning of the lambda 
sequence and took disjoint segments of length 27,2°,...,2'*, repeating the seg- 
mentation for 6 complete cycles. In all, 48,378 of the 48,502 letters were grouped 
into 36 disjoint segments. 

Each segment of length 2” was matched against every other segment of length 
2™, and the longest k-interrupted match run was determined for each value of 
k = 0,1,2,...,6. This matching scheme accounted for 630 determinations of 


longest match runs M,(2”, 2”): (3) pairings with 6 levels of m by 7 levels of k. 


SEQUENCE MATCHING 991 





1B 
16 
ib 
An. 
0 i 
B 6 
S i 
E ' 
R ! 
V 
E ' 
D i 
~“ EEEREN 
A ! 
T f 
c i 
H l 
E i 
5 | 
D ; ; 
N p o =--77--- nem- Bot a in 
A 
t 
P 
= 12.5 
2 
5 : 
0 ' 
4 10.04---------------- gh GG --4-------- EE EA EEE E E 


E O 
Seed E TEE E E PEET EE N E 


----------------8- 


j----~-----~-------8- 


4 6 8 10 12 14 16 18 20 22 24 26 28 
ASYMPTOTIC THEORETICAL CENTER #2 


Fia. 1. 


For the value of p = 0.2505 = 0.25437 + 0.2343? + 0.2643? + 0.24717, we com- 
puted the exact solution u,(m) to the equation 


(7.1) 27m = (1 — p)G,(u), 


where G,(u) is defined in (4.1). If the order of letters in the lambda sequence 
were well approximated by an iid. scheme with the cited frequencies, we would 
expect as a consequence of Theorems 1 and 2 to observe longest match runs 
consisting of about u,(m) + y/l — 3 letters. 

Presented in Figure 1 is a scatterplot of the 630 actual values of longest 
k-interrupted match runs plotted against their asymptotic theoretical values 


992 R. ARRATIA, L. GORDON AND M. WATERMAN 


TABLE 1 
Comparison of two asymptotically equivalent forms for centering constant 


m 7 7 7 10 10 10 12 12 12 

k 0 3 6 0 3 6 0 3 6 
u,(q2?") 68 120 151 98 158 197 118 182 225 
u,(q2?") 68 134 190 98 169 229 118 192 254 
u~- op 00 -14 -39 00 -1l -32 00 -10 -29 


u,(m) + y/l— 3. The digits 0,1,...,6 signify the value of k for the particular 
k-interrupted run. Because the longest match run must take on integer values, 
the points themselves are randomly jittered by a small uniformly distributed 
value in order to display the density of deviations. The data appear well centered 
about the theoretical value. 

The reader will note that we have used u,, and not the asymptotically 
equivalent value v, defined in (4.6) and used in Theorem 2. That such accommod- 
ation to small sample properties is necessary is evident from Table 1 (and our 
initial less satisfactory attempts at plotting the longest match run against v,). 

Given the remarkable fit of predicted location, the next natural question is the 
quality of approximation for the predicted dispersion. Here the answer is less 
satisfactory. Spread is greater than predicted by the asymptotic theory, espe- 
cially for larger values of k. We believe that a large portion of the lack of fit is 
attributable to slowness of convergence. This suspicion is supported by the data 
from the bacteriophage lambda, and from a simulation in which the segments’ 
letters were truly generated by an i.i.d. mechanism, with uniform distribution 
upon an alphabet of four letters. 

As a consequence of Theorem 2(c), P{M,(2",2™) — |u,(q2?”)| = j} should be 
well approximated by exp(—p/*') — exp(~np/), where log, ,.9 = 6,(u,(q2””)), 
the fractional part of u,(q2?"). Presented in Table 2 is a comparison of 


TABLE 2 
Observed and expected deviations from asymptotic 
centers for the longest match run 


lambda ( p = 0.2505) simulation ( p = 0.2500) 
j observed expected observed expected 
-4 1 0.0 0 0.0 
-3 3 0.0 2 0.0 
—2 26 1.6 27 1.2 
-1 103 93.1 152 87.1 
0 192 277.9 217 277.3 
1 173 177.7 148 182.1 
2 79 58.6 60 60.5 
3 32 15.8 15 16.3 
4 15 4.0 5 9.1 
5 3 1.0 3 1.0 
6 2 0.3 1 0.3 
7 1 0.1 0 0.1 


SEQUENCE MATCHING 993 


deviations from |u,(q2?”)|. The columns for expected values are computed as 
discussed above for the biological sequence with p = 0.2505, and for the simu- 
lated biological sequence with p = 0.2500. The “observed” columns correspond to 
the biological data plotted in Figure 1, and for simulated data from a uniform 
four-letter alphabet with segments of identical length. 

We tentatively conclude from this simulation and from other simulation work 
that the degree of lack of fit from asymptotic prediction could be attributable tc 
small sample properties of the distribution of the maximum k-interrupted match 
run length. A greater understanding of rates of convergence to the integerized 
extreme value distribution is clearly needed. 


Acknowledgment. The authors thank S. Karlin for calling to our attention 
the question of the locations of long matches, addressed here in Theorem 1’. 


REFERENCES 


ANDERSEN, J. S. et al. (1984). Nucleotide Sequences 1984. A Compilation from the GenBank and 
EMBL Data Libraries 1, 2. IRL Press, Oxford. 

ARRATIA, R. and WATERMAN, M. S. (1985a). An Erdés—Rényi law with shifts Adv. ın Math. 55 
13-23. 

ARRATIA, R. and WATERMAN, M. S. (1985b). Critical phenomena in sequence matching. Ann. 
Probab. 13 1236-1249. 

BONFERRONI, C. E. (1936). Teoria statistica delle clasm e calcolo delle probabilita. Pubbl. Istit. Sup. 
Sct. Econ. Commercial Firenze 8 1-62. 

Boyp, ID). W. (1972). Losing runs in Bernoulli trials. Unpublished manuscript. 

DARLING, R. and WATERMAN, M. S (1985). Matching rectangles in d-dimensions: algorithms and 
laws of large numbers. Adv. in Math. 65 1-12. 

DOOLITTLE, R. F. et al. (1983). Simian sarcoma viruses one gene v-sis is derived from the gene (or 
genes) encoding a platelet-derived growth factor. Science 221 275-276. 

Erpdés, P. and Révész, P. (1975). On the length of the longest head-run. Topics in Informaticn 
Theory. Colloquia Math. Soc. J. Bolyat 16 219-228. Keszthely, Hungary. 

FERGUSON, T. S. (1984) On the distribution of Max and Mex. Preprint. 

GORDON, L., SCHILLING, M. F. and WATERMAN, M. S. (1986). An extreme value theory for long heed 
runs. Probab. Theory Rel. Fields 12 279-288. 

GuiBas, L. J. and ODDLYZKO, A. M. (1980) Long repetitive patterns in random sequences. Z, 
Wahrsch. verw. Gebiete 53 241-262. 

Guipas, L. J. and OpLyzko, A M. (1981). Periods in strings. J. Combin. Theory Ser. A 30 19-42. 

Harpy, G. H., LITTLEWOOD, J. E. and PÓLYA, G. (1934). Inequalities. Cambridge University Press. 

KARLIN, S., GHANDOUR, G., Ost, F, TAVARE, S. and Korn, L. J. (1983). New approaches for 
computer analysis of nucleic acid sequences. Proc. Nat. Acad. Sct. U.S.A. 80 5660-5664 

LEADBETTER, M. R., LINDGREN, G. and ROOTZĖN, H. (1983). Extremes and Related Properties of 
Random Sequences and Processes. Springer, New York. 

SANGER, F, Cou.son, A. R., Hone, G. F., HILL, D. F. and PETERSON, G. B. (1982). Nucleotide 
sequence of bacteriophage à DNA. J. Molecular Biol. 162 729-773. 

SMITH, T. F., WATERMAN, M. S. and BURKS, C. (1985). The statistical distribution of nucleic acid 
similarities. Nuclerc Actds Research 13 645-656. 

WATERMAN, M. S. (1984). General methods of sequence comparison. Bull. Math. Buol. 46 473-500. 

Watson, G. S. (1954). Extreme values in samples from m-dependent stationary stochastic process2s. 
Ann. Math. Statist. 25 798-800. 


DEPARTMENT OF MATHEMATICS 
UNIVERSITY OF SOUTHERN CALIFORNIA 
Los ANGELES, CALIFORNIA 90089 


The Annals of Statutics 
1986, Vol. 14, No. 3, 994-1011 


SKEWNESS AND ASYMMETRY: MEASURES AND 
ORDERINGS! 


By H. L. MACGILLIVRAY 
University of Queensland 


Recent interest in skewness has tended to separate two aspects of the 
concept. Two distributions may be compared with respect to skewness, or a 
distribution may be self-compared, that is, the distributions of the random 
variables of X and — X may be compared. This paper uses the unification of 
these two aspects to attempt to complete a skewness structure of orderings 
that identifies the roles of various skewness and scale measures and enables 
classification of the skewneas properties of any distribution. The structure is 
also used to propose measures of asymmetry. Some skewness properties of the 
Weibull and Johnson systems are examined. 


1. Introduction. Sections 1.1, 1.2, and 1.3 give a brief historical survey of 
work on skewness, including measures, partial orderings, and relationships be- 
tween location parameters. With reference to this, Section 1.4 outlines the aims 
and content of this paper. 


1.1. Classical measures. As for location, scale, and kurtosis, the concept of 
skewness was introduced with an apparently appropriate measure. Pearson (1895) 
proposed (u — M)/o as a measure of skewness for a univariate distribution with 
mean p, mode M, and variance o”. Three other measures of skewness appear to 
have been introduced soon afterwards [Bowley (1901), pages 116 and 251, and 
Yule (1911), page 162]. These are (p — m)/o, where m is the median, p,/0°%, 
where u, is the third central moment, and (q, + q; — 2m)/(q,, — qı), where 
qu qı are the upper and lower quartiles, respectively. All are based on the criteria 
that a skewness measure should be scale-free and zero for symmetric distribu- 
tions. 

The initial roles of (p — m)/o and y,/o° seem to lie in their relationships 
[empirical in the case of (u — m)/o] to the Pearson skewness for distributions of 
the Pearson family. Gradually »,/o° assumed more prominence as “the skew- 
ness,” as is illustrated by the papers of Doodson (1917) and Haldane (1942), who, 
in examining Pearson’s empirical relationship between M — m and M — pn, used 
(u — M)/o and p,/o%, respectively, to measure skewness. For (p — m)/o, 
Hotelling and Solomons (1932) and Garver (1932) showed {u — m)/o| < 1 with 
this result being refined by Majindar (1962). 

The quartile measure of skewness was introduced through its sample version 
as a descriptive statistic and the bound of 1 on its absolute value was regarded as 


Received January 1985; revised October 1985. 

'Part of this research was done while the author was on leave at Heriot-Watt University, 
Edinburgh. 

AMS 1980 subject classifications. Primary 62E10; secondary 60E05. 

Key words and phrases, Asymmetry, skewness, scale, partial orderings of distributions, mean, 
median, moments, quantiles. 


994 


SKEWNESS AND ASYMMETRY 995 


an advantage. It was used to test symmetry by David and Johnson (1954), who 
also suggested (1956) its generalisation to a quantile measure which is defined for 
a continuous random variable with distribution function F by 


(1.1) vA F) = [F0 - u) + F(u) - 2mp] /[F 0 — u) - F(u), 
u (0,2). 


These measures have been used by Hinkley (1975), Hogg (1974) and Hogg et al. 
(1975). 

Yule (1911, page 162) noted that the measures (q, + qı — 2m)/(q„ — q,) and 
(— M)/o “are positive if the longer tail of the distribution lies toward the 
higher values of the variate.” Fechner (1897) and Timerding (1915) examined 
asymmetric densities for which u, m, and M occur in this or the reverse order, 
but their papers seemed to be little known. For different reasons, a number of 
authors [Groeneveld and Meeden (1977), Runnenburg (1978), and MacGillivray 
(1979)] almost simultaneously considered conditions giving the sign of u — m and 
m — M, or u. van Zwet (1979) showed the link to his partial ordering (see 
Section 1.2), and MacGillivray (1981) the link to the properties of the strictly 
totally positive kernel x” [Karlin (1968), page 21]. It is curious that although 
asymmetric distributions with u, = 0 were familiar [see, for example, Ord (1968) 
and Johnson and Kotz (1970), page 253], conditions sufficient to determine the 
sign of u were not previously established. 


1.2. Partial orderings of distributions. As noted by van Zwet (1964) and Oja 
(1981), the measures of Section 1.1 were equated with a concept without answers 
to the questions: when are the measures appropriate representatives of the 
concept, and, if not, what can be used? Each measure either assumes or imposes 
an ordering between distributions. van Zwet (1964) used convex transformations 
to examine and formalise the concepts of skewness and kurtosis, claiming that 
such concepts require meaningful orderings of distributions which measures of 
skewness or kurtosis must then preserve. For distribution functions F and G 
which have densities with an interval support, van Zwet defined G as having 
greater skewness to the right than F, if G~\(F(x)) is convex on Ip= 
{x: 0 < F(x) <1}, and showed that the standardised odd central moments 
preserve this ordering. That is, if G~( F(x)) is convex on Ip, then 


bp orti/ 0p" < ba, 2k+1/06**}, k=1,2,.... 
This ordering with respect to the exponential distribution gives increasing failure 
rate (IFR) and decreasing failure rate (DFR) distributions and van Zwet’s 
orderings, plus others based on star-shaped transformations, have been used in 
reliability theory [for example, Barlow and Proschan (1966)], power of rank tests 
[Doksum (1969)], and inequalities on order statistics [Lawrence (1975)]. 

Mann and Whitney (1947) had previously introduced the notion of “stochastic 
ordering,” a partial ordering of distributions with respect to location. Bickel and 
Lehmann (1975) used this in their work on measures of location for asymmetric 
distributions, and also introduced (1976) a partial ordering for distributions, 


996 H. L. MacGILLIVRAY 


symmetric or asymmetric, with respect to spread or dispersion. Oja (1981) unified 
the work on partial orderings of distributions according to the different attributes 
of location, scale, skewness, showing that the definitions introduced by the 
authors above correspond to convexity of increasing order [Karlin (1968), page 
23] of the same function, namely 


Ap olx) = G'(F(x))-x on Ip. 


Oja introduced some weaker orderings for scale, skewness, and kurtosis, and 
discussed how to find measures that preserve the various orderings, concentrating 
on measures based on moments. 


1.3. Measures based on quantiles. The function A(x) had been previously 
considered in the special case of G(x) = 1 — F(—x) = F(x) by Doksum (1975) 
in considering real-valued functionals satisfying the usual location axioms, 
and their values, giving the location set. The symmetry function 6@,(x) = 
1[x — F~'(F(x))] has the location set as the closure of its range, and is the only 
function such that x — 26,(x) is nondecreasing a.e. (with respect to F), and 
X — 26,(X) has the distribution of — X. Doksum used the difference between 
p(x) and the median mp as a measure of asymmetry, and defined F as being 
strongly skewed to the right if and only if 6,(x) is nonincreasing for x < mp and 
nondecreasing for x 2 Mp, and skewed to the right if and only if 6,(x) = Mp, 
(9,(m,) = mp). These definitions are weaker than that due to van Zwet (1979) 
and used by Oja (1981), that F is skewed to the right if there exists a symmetric 
distribution F, such that F~'(F,(x)) is convex on Ip. ry As stated by van Zwet 
(1979), this is equivalent to F more skewed to the right than F, the distribution 
of — X. 

Doksum’s (1975) weaker definition of skewness to the right is equivalent to van 
Zwet’s (1979) condition, F~'(t) + F~ (1 — t) > 2mp, for the mean, median, and 
mode to occur in reverse order, so that (un; — M,)/op and (up— Mmp)/op are 
nonnegative. Doksum’s symmetry function and index of skewness are 
related to the quantile measures of skewness, y„(F) of (1.1). This is discussed in 
Section 2. 


1.4. Outline of paper. Thus against a background of a variety of motivations 
and applications, a variety of skewness measures and/or orderings have been 
used. Some links have been given but the structure is not complete. There 
appears to be reasonably general agreement on two points. First, comparing the 
skewness of two distributions refers, whether implicitly or explicitly, to a partial 
ordering of distributions with respect to skewness, and a measure of skewness 
should preserve an acceptable ordering. Second, discussions of the skewness of a 
distribution to the left or right, refer to skewness comparisons of the distributions 
of X and —X, described here as self-comparisons. Therefore, for consistency, 
orderings and measures established in one context should apply also to the other. 
In particular, meaningful interpretation of a measure’s numerical value and sign 
requires knowledge of the orderings it preserves. 


SKEWNESS AND ASYMMETRY 997 


Section 2 therefore aims to identify a skewness structure according to the 
following criteria: 


(a) The same structure should apply to comparisons of different distributions 
and to self-comparisons for a single distribution. 

(bi It should identify the roles of skewness measures and partial orderings 
previously introduced in either of the contexts in (a), and provide a hierarchy 
of such. 

(c) No skewness measure should be used without identification of the ordering it 
preserves, and this ordering should aim to cover as large a class of distribu- 
tions as possible. 

(d) It should be possible to describe the skewness of any asymmetric distribution. 

(e The structure should not be more complicated than is necessary to meet 
(a)-(d). 


In certain circumstances arbitrary distributions have been considered by some 
authors, but there are complications in the general structure, and for simplicity 
attention is confined here to the class, F, of distribution functions F(x) which 
have probability density functions f(x) with an interval support. 

Although not necessarily complete the resultant hierarchy of orderings is 
reasonably extensive. Some orderings may be more useful in practice than others, 
but the hierarchy is an important background structure underlying any descrip- 
tion of skewness properties; like convergence, skewness has a range of “strengths.” 

On the basis of the skewness structure, and ideas in Doksum (1975), Section 
2.4 discusses measures of asymmetry. 

Section 3 examines the skewness properties of some distribution families, not 
only to illustrate various aspects of the skewness structure but also to increase 
the knowledge of the properties of these distributions. The Weibull family and 
Section 2.4 indicate that for a complete understanding of the skewness properties 
of some distribution families, it may be necessary to consider both self-compari- 
sons and comparisons between the distributions. 


2. Orderings and measures. van Zwet’s (1964) partial ordering with re- 
spect to skewness on F is the strongest that has been considered and it is 
unlikely that anything stronger is useful. Using a slight modification of Oja’s 
(1981) notation, F <, G iff G~'(F(x)) is convex on Ip, and F is said to be not 
more strongly skew to the right than G. “Convex” here includes the possibility of 
linearity. It may sometimes be convenient to define F <, G iff G~\(F(x)) is 
strictly convex, that is convex but not linear, on Ip, in which case G is more 
strongly skew to the right than F. 

An equivalent definition of F <, G is that F(x) and G(ax + b) cross each 
other at most twice for any a, b, or, using Karlin’s (1968, pages 20 and 280-282) S 
notation, S~ (F(x) — G(ax + b)) < 2 for all a, b, with the sequence of signs 
being positive to negative to positive when equality holds. Thus van Zwet’s 
skewness ordering has no reference to any measures of location and scale, and any 
weakening of the ordering in the sense of covering larger classes of distributions, 
involves reference to particular location and scale parameters. 


998 H. L. MacGILLIVRAY 


2.1. Orderings with respect to the mean. Oja’s (1981) ordering <# is a 
weakening of <, that is still preserved by the standardized odd moments 
Hons1/0-**!. The formal definition is F < ž G iff there exists x, < pp < x} such 
that 


On - 0 o, 
ORG) -a E PEs + (no - Ene] 
F 


Or 


(2.1) 
resp. for x; SX < Xo, x<%, OF X2Xq. 


Equivalently the standardised distributions F(p p + opx) and G(pg + agx) cross 
each other exactly once on each side of x = 0, with F(p p) < G(ug). However, 
the restriction on the positions of the crossings, although implied if F <, G, is 
not required for an ordering preserved by the standardised odd moments. 

From MacGillivray (1985), Mpg = F(up + opx) — Glug + ogx) is either 
identically zero or changes sign at least twice; if exactly twice, say from > 0 to 
<0 to 20, then py ops1/op**! < te ons /0G"*', R= 1,.... Hence skewness 
with respect to the mean may be defined as follows: 


DEFINITION 2.1. G is more skew to the right with respect to the mean than 
F, F <$G, iff 


(2.2) S"(My.¢) =2 from = 0to <0to 20. 
More generally, F < {4G iff (2.2) holds or Mp ç = 0. 


THEOREM 2.1. 
F <$G = tp ops 1/07"! Steswvifoa k=1,...; 


and if equality holds for any k, then My g = 0. 
Proor. This result is a special case of Theorem 1 of MacGillivray (1985). O 


In the case of self-comparisons, Fs * F e F < F e S-(1— F(up+ x) - 
F(ue—x)) <1, from =0to <0, which is the established sufficient condition 
for Hp 2k41 < O. 


2.2. Orderings and measures with respect to the median. van Zwet (1964, 
page 16) gives an example using a discrete distribution that shows that a 
nondecreasing convex transformation of a random variable does not necessarily 
increase (p — m)/o. An example using continuous distributions is provided by 
the transformation Y = exp X of an exponential random variable with mean 1/A. 
As À > 2, (fy — my)/oy > 0, and hence there are values of A for which 
(Hy — My)/oy < (py — mMx)/ox =1— log2. So (u — m)/o does not preserve 
<, (nor therefore <4), but because there exists a condition on a distribution 
sufficient for u S m, the scale measure appears to be at fault. 


SKEWNESS AND ASYMMETRY 999 


LEMMA 2.1. If 
(2.3) (G-{u) i mea)/nc 2 (F-{u) = mp)/nr, u € (0,1), 
where Nr, ng > 0, then (ug — Mg)/Ng = (hr — Mmpe)/nr- 

Proor. Follows from pp — mp = JAF (u)- mp)du. O 


LEMMA 2.2. 


F <, G = (2.3) iff nc/np = f(mp)/g(mg), 
where f, g are the probability density functions of F, G. 


Proor. G`(F(x)) convex on Ip is equivalent to 
(2.4) [G-(F(x)) - G(F(y))]/(x — y) is nondecreasing for x € I, for any y. 
Hence, taking y = mp, F <, G = (2.3) iff 
1G/Nr = im, (Gu) - mg)/(F-(u) — mp) 


f(mp)/g(mg), 


since f, g are nonzero in Ip, J, for F,G € ¥.0 


Il 


The above lemmas provide an ordering that weakens <, and is preserved by 
{u — m)/(a measure of scale), which therefore also preserves < ,. The weakening 
procedure is different to that of Section 2.1, and starts from a different char- 
acterisation of <, , given by (2.4). There are intermediate steps between (2.4) 
and (2.3) and each has been used in a skewness context. 

Oja’s (1981) star-ordering, S star [which generalises the star-shaped orderings 
of, for example, Barlow and Proschan (1966), Doksum (1969), and Lawrence 
(1975)] is (2.4) for some value of y. However, the value of y is important, since 
(2.4) for a particular y is a weakening of <, to considering skewness with 
respect to y. This point then becomes the particular location parameter around 
which the skewness is taken. Taking y = mp, the following definitions of skew- 
ness progressively weaken <, with respect to the median. 


DEFINITION 2.2. F<7G iff 


(2.5) [G-'(F(x) - mg]/(x — mp) is nondecreasing in Ip, 
or, equivalently, 
(2.6) (Gu) — mg)/(F-(u) — mp) is nondecreasing in (0, 1). 


DEFINITION 2.3. F<?G iff 
D 


(2.7) GFC) = [f(omp)/alme)] x is { RONEN) for a{ Š Jre in Lp 


1000 H. L. MacGILLIVRAY 


or, equivalently, 


G-{u) — [ f(mp)/e( mg) Fu) is 


nonincreasing 
(2.8) nondecreasing 
for u{ Š }4 in (0,1). 
DEFINITION 2.4. F< 3G iff 
(29) [G-(F(x)) - mg]e(mg) > (x - mp) f(mp) for x € Ip, 
or, equivalently, for u € (0,1), 


DEFINITION 2.5. F< 3G iff 


[G71 - F(x)) - me] /[F 1 - F(x)) - mp] 


> [G7 F(x)) - mg|/[x —my] forx <m,in Ip 
or, equivalently, for u € (0, 4), 
[G-1(1 — u) ~ me] /[F>0 — u) - mp] 


(2.11) 


(2.12) 
> [G-(u) - ma] /[ Fu) ~ mp]. 
THEOREM 2.2. 
F<,G>Fs<7G=»=F<7G=Fs7ýG=>=F<7G. 
star D Y 


Proor. (2.5) is (2.4) for y = mp. 

Differentiating (2.5) gives 
{ f(x)/e[G@-F(x))] - [@-(F(x)) - mg|/(x -— mz)} (x - my) 20 in Ip, 
which gives f(x)/g[G~\F{x))]{ Z }f(mp)/g(mg) for x{ 2 jmp = (2.7). 

Clearly (2.7) = (2.9) = (2.10) = (2.12) + (2.11). O 


Definition 2.3 gives the ordering generalised from Doksum’s (1975) definition 
of strong skewness to the right, namely, 


(2.18) FU(i+u)+F°%(i-u) is nondecreasing for u € [0,4]. 
This is equivalent to F < 7 F. The strength of skewness given by (2.13) has also 


been mentioned by van Zwet (1979) and used by David and Groeneveld (1982). 

Definition 2.4 generalises the two conditions 1 — F(m + x) — Fim-—x)20 
and F~ (1 — u) + F-\(u) — 2m, 2 0, which were given by van Zwet (1979) as 
sufficient for 4 p > mp, and which were Doksum’s (1975) definitions of F skew to 
the right. Definition 2.4 plays the role of <§ but with respect to the median 
instead of the mean. 


SKEWNESS AND ASYMMETRY 1001 


It should be noted again that in self-comparisons, that is, comparing F and F, 
the measure of scale is not relevant and only the sign of a skewness measure is 
relevant. If a skewness measure is quoted quantitatively it immediately implies 
comparisons with other distributions and the measure of scale then plays an 
essential role. 

Groeneveld and Meeden (1984) show that y,(-) and (u — m)/E|X — m] pre- 
serve <,; Lemmas 2.1 and 2.2 introduce (u — m)f(m,). Theorem 2.3 below 
identifies the roles of various skewness measures in the heirarchy of skewness 
orderings given above. If a measure preserves an ordering it also preserves the 
preceding orderings in the hierarchy of Theorem 2.2. Apart from preserving an 
acceptable skewness ordering, Oja (1981) also suggested that skewness measures 
should change only by multiplication by sign(a) under a linear transformation 
ax + b of the random variable; all the skewness measures below satisfy this 
requirement. In some of the measures in Theorem 2.3, the mean and the quantile 
average F-'(1—u)+F-\u)— 2m, are generalised to the symmetrically 
weighted quantile averages defined by 


(2.14) ax(F) = fF (u)dK(u) = [UR MG - u) + F(u)] dK(u), 
0 0 
where K(u) is a distribution function on (0, 1) symmetric around }. 


THEOREM 2.3. 
(a) F<ẸG ey (F) < YG) forue (0,4); 
Y 


(b)(i) F<7G= (ux(F) i my) f(m,) Ss (ux (G) = mg)g(mg), 
(ii) F< 7G = E(X - mp)" (mg) < E(Y - mg)"**'g(mg), 
hk =1,2,..., 
(iti) F< 7G = (up— m,)/E|X -mpl < (ta — Mg)/E|Y — mal, 
where X,Y ~ F,G. 
(c) F<?G=>([FU(1-u)+F-\u) - Fl -— a) - F {a)] 


/[F(1 — a) + F-{a)] 
< [a7 —u)+ Gu) - G1 — a) - G-(a)| 
/[|G-1 ~ a) - GYa)] 
forOsusas<tf. 
Proor. (a) [This result is stated in Groeneveld and Meeden (1984)]. 


Without loss of generality mp and mg may be assumed to be 0. For u € (0, 3), 
(2.12) = 


ais G-'(u)F-(1 — u) - G71 - u) F-Y(u) 


> G1 — u)F (u) - G-{u)F (1 — u). 


1002 H. L. MacGILLIVRAY 


Adding G7 (1 — u)F71(1 — u) — G7 {u)F~ (u) to each side of (2.15) gives 
[G-(1 -u) + G@“"(u)] [Fd - u) - Fou) 
= [Fd —u)+ F-\(u)|[GU — u) — G` Y{u)], 
and conversely, as required. l 
(b)(i) follows directly from (2.10) and definition of p,(-). 


(ii) follows from (2.9) written as F(m, + x/f(m,)) = G(mg + x/g(mg)). 
(iii) follows from Groeneveld and Meeden (1984) since 


F<3G@ = [G71 - u) - me]/[F' — u) - mp] > f(mp)/e(me) 
> [G (u;) - mol] /[F- (u) —m,| for0 < u, u<}. 
(c) Fs3G={(F-(1—»))/e(G" (1 - o)) 


2 [e -a)- ma]/[FA -a)- mp] 

2 [G (a) - mo] /[F a) — mp] 

> f(F'(v))/e(G(v)) for0<v<a<h. 
Hence, 


[1/g(G~'(1 — v)) - 1/e(@-%v))] [Fo — a) - F-"(a)] 


> [1/f(F- — v)) — 1/7(F-"(e)) [G0 - a) - G7 {a)]. 
Integrating (2.16) with respect to v from u to «a gives required result. 0 


(2.16) 


REMARK 2.1. Three different measures of scale are incorporated in the 
skewness measures in Theorem 2.3, namely 1/f(m,), F~\(1 — u) — F~(u), and 
E|X — m,|. Each of these preserves the spread-ordering due to Bickel and 
Lehmann (1976) which is used by Oja (1981) and is defined by 


Fs,G iffG (Fk — x is nondecreasing, 
awe. 2S (F(x) = x isn : 


e Gv) — G Hu) > Fv) —- F(u) O<u<v<]l. 

They also satisfy the other requirement of these authors for a scale measure in 
that under a linear transformation ax + b of a random variable, the scale 
measure is changed only by multiplication by ja|. Oja (1981) weakens <, with 
respect to the mean to give a scale ordering <¥* which is still preserved by the 
variance; this is the scale analogy to Section 2.1 above. Similarly to Definition 2.4 
above, <, may be weakened with respect to the median to give scale with 
respect to the median, defined by 


DEFINITION 2.6. F<7G iff 


(2.18) G"'( F(x)) — x{ 3} mg - Mp for x{ S }mp in Ip, 
or, equivalently, 
(2.19) Gu) — F-(u){S}mg - my for u{S}i. 


The three scale measures appearing in Theorem 2.3 preserve <T. 


SKEWNESS AND ASYMMETRY 1003 


REMARK 2.2. Although 1/f(m,) arises naturally in this context as a scale 
measure and has been used as such in central-density scaling [see, for example, 
Rogers and Tukey (1972) and Rosenberger and Gasko (1983)], it is perhaps not as 
appealing as F-'\(1 — u) — F`{u) or E|X — m|. For example, both 
Ku — m)/E|X — m|| and |y,(-)| are bounded by 1, with jy,(F)| > ,.91 at least 
for semi-infinite Ip. However 1/f(m p) is essentially the scale measure that arises 
in the orderings as < , is gradually weakened. This is illustrated by considering 
two distribution functions F and G with medians zero and the same expected 
absolute deviation from the median. If F<,G, S-(F— G)=2 so that an 
ordering on’ F and G using this scale measure would give inconsistencies with the 
criteria of Section 1.4. 


REMARK 2.3. Oja’s (1981) < ** ordering is a weakening of <, asin Section 
2.1 but G~'( F(x)) is compared with some arbitrary line ax + b, thus involving 
particular but arbitrary measures of location and scale. Similarly, <, may be 
weakened as in this section but with respect to some quantile other than the 
median. To date neither of these concepts appear to have been discussed in a 
skewness context; however the second concept is of importance in the kurtosis 
context [Balanda and MacGillivray (1986)]. 


2.3. Central and tail skewness. There is still the problem of what to say 
about the skewness of F when it does not satisfy any of the orderings given so far 
with respect to F. One solution is to identify a portion of the distribution centred 
on the median, that is, some proportion (1 — 2a) of the central part of the 
distribution, where F and F may be compared according to one or more of the 
orderings of Theorem 2.2. Doksum (1975) also suggested restricting attention to 
such a central portion for some fixed a, from the point of view of parameter 
robustness. The structure of Theorem 2.2 and measures such as y,(-) may all be 
applied to some central portion only, with notation such as < 9°, < °°, and 

< f° and the skewness called central skewness. atar D 

For many distributions, the central skewness may cover a sufficiently wide 
proportion 1 — 2a of the distribution for practical purposes. When this is not the 
case, a further description of the skewness properties may be required, in terms of 
changes between skewness to the left or right according to some ordering. 
Doksum (1975) suggested examining changes between increasing and decreasing 
of the symmetry function ĝp(x), thus referring to the < 7 ordering. However in 


portions of the distribution that do not include the median; the orderings of 
Theorem 2.2 do not necessarily form a hierarchy. Since a weakening of this 
structure is required, it may be more appropriate and easier to consider changes 
of sign of g(Mmo [G F(x) — mg] — f(mp)(x — m p), or, equivalently, and more 
conveniently g(m,)(G~'(u) ~ mg) — f(m XF (u) — mp). 


DEFINITION 2.7. G is at least as skew to the right with respect to the median 
as F in the proportion interval (1 ~ 2a,,1 — 2a,) if (2.10) holds for u and 1 — u 
with u E€ [a a]. 


1004 H. L. MacGILLIVRAY 


COROLLARY 2.1. If Definition 2.7 holds, then for u E [a,, a], 
(2.20) YF) <syAG) and »(F) < »(G), 
where v,(F) = (F-'"01 — u) + F-'(u) — 2mp)f(mp). 


In the special case G = F, (2.20) is necessary and sufficient. 

The definition has been given in terms of <7 rather than =F to allow for 
central-density scaling. 

Central skewness refers to a proportion interval [0,1 — 2a]; when the propor- 
tion interval is [1 — 2a,1], F and G are being compared with respect to skewness 
in the tails. Central and tail skewness tend to be more easily established than 
skewness in other proportion intervals. For example, all distributions with Tp = 
[0, co] are at least tail-skew to the right. The asymmetric Tukey lambda family 
[Ramberg et al. (1979)] is defined by 


(2.21) F-(u)=),+[u-(1-u)]A,, Osu, 


and exhibits a variety of skewness behaviour, but all members with A, > (<)A, 
are at least tail-skew to the left (right) [MacGillivray (1982)]. 


DEFINITION 2.8. F< °G if Definition 2.7 holds for the proportion interval 
[0,1 — 2a] for some a. 

F <%:'G if Definition 2.7 holds for the proportion interval [1 — 2a,1] for 
some a. 


When y,(F’) has only one change of sign for u € [0, 3], central skewness and 
tail skewness are all that is required to describe the skewness properties of F. The 
relative importance of the central and tail skewness depends on the value of u at 
which the change-over occurs as well as the context of consideration. Figure 1 
presents the hierarchy of skewness orderings. 


2.4. Measures of asymmetry. Doksum (1975) defines an index of asymmetry 
for the central 100(1 — 2a)% of the distribution by 


(2.22) sup ACU) = ink O6(F(u)| J» 


asusl1/2 
where op is the standard deviation or some other measure of spread, and where 
Op(F-= (u) = HF- Xu) + FG ~ u)]. 
From Section 2.2, it follows that it is more appropriate to consider either 
central-density or inter-quantile scaling, thus using instead of (2.22), either 


(2.23) sup »(F)- inf »,(F) 
asusl/2 asusi/2 

or 

(2.24) sup y,(F)— inf y (F). 
asusl/2 asusi/2 


The class of distributions that are skew to the right according to <7 is now 
denoted by Fp and the class of distributions skew to the left according to <7 
i by F,- 


SKEWNESS AND ASYMMETRY 1005 


F s G 
star FMG 
Pa 
3c 
Pine G 
u 
Fsg G F< G 
2 
m,c m,t 
P eag boos 
Fs G 2 
2 
y 


Fic. 1. Partial orderings of F and G with respect to skewness. 


Within Fp, infa<u<iz”uF) = 0 = infasu<172 Y(F), and hence (2.23) 
and (2.24) preserve <7 for any a < } (and hence all the preceding orderings in 
the hierarchy of Theorem 2.2). For the class Fp, Doksum calls the index of 
asymmetry an index of skewness for the central 100(1 — 2a)%. Within #,, 
BUD, < u <1/2 vF) =0= BUPy < u <1/2 YaF), so that — (2.23) and — (2.24) preserve 
<f for any a < }. Of course, these comments may be generalised to the wider 
classes of distributions that are centrally skew to the right or left for some 
fixed a. - 

Hence for distributions that are skew to the right or to the left for their whole 
domain, the proposed measures of asymmetry (2.23) and (2.24) are measures of 
the amount of skewness regardless of whether it is to the right or left. This seems 
to be a reasonable interpretation of the concept of asymmetry and it can be 
extended to comparing distributions that do not belong to either Fp or #,. For 
example, if F E€ Fp and G € F, satisfy F < 3G, then, sup|y,(F)| < sup|v,(G)| 
and supjy„,(F)| < sup|y,(G)|. G is more skew to the left than F is to the right, 
and it is reasonable to say that G is more asymmetric than F. Continuing this 
concept to cover distributions whose skewness may change direction within their 
domain, leads to a definition of an ordering between distributions with respect to 
asymmetry. 


DEFINITION 2.9. F is skew to the right (left) in the proportion interval 
(1 — 2a,,1 — 2a,) if F is as least as skew to the right (left) with respect to the 
median as F in this interval. 


1006 H. L. MacGILLIVRAY 


DEFINITION 2.10. Suppose that [0,1] is partitioned into intervals so that in 
each interval F is skew to the right or left and G is skew to the right or left. 
Let 


r-l where F is skew to the right, 
1 \F where F is skew to the left. 


Similarly for G,. 

Then G is at least as asymmetric as F, F < ,G, if, in each interval of the 
partition, G, is at least as skew to the right as F,. 

Similarly G is at least as asymmetric as F in the central 100(1 — 2a)% of the 
distribution, F < , ,G, if instead of [0, 1], only the proportion interval [0,1 — 2a] 
is considered. 


THEOREM 2.4. |»,(-)|, |Y), and their suprema over u € (0,4) preserve 
<a - Similarly |v,{-)|, |y,{-)|, and their suprema over u € (a, }) preserve <4 a 


Proor. Follows immediately. 0 


Hence sup, <u<i/2l,(-)] and sup, <y<1/2lY.(-)| ate suggested as measures of 
asymmetry instead of (2.23) and (2.24), coinciding with them for distributions 
belonging to Fp or F. 

The quantities sup,|»,(-)| and sup,|y,(-)| also have further justification as 
measures of asymmetry. Of all the distributions symmetric about mp, H(x), 
defined by H~'(u) = 4[F-'. — u) — F-u)] + mp, best approximates F(x) in 
the sense that sup,|F~\u) — H~(u)| is a minimum [Doksum (1975)]. Since 
|F- (u) — H7'(u)| = 4 F~ {u) + F741 — u) — 2m,\|, this is the distance the 
uth quantile has to move to become the uth quantile of the “closest” symmetric 
distribution. H(x) is also the only distribution symmetric about mp with the 
same inter-quantile scale measure, namely F~1(1 — u) — F~\(uw) for all u. Thus, 
as well as preserving a reasonable ordering with respect to asymmetry, sup|»,( F’)| 
and sup|y,(F’)| give a minimum standardised distance between F and a distribu- 
tion symmetric about m p. 

Figure 2 shows y„(F ) for some members of the Ramberg et al. (1979) asymmet- 
ric Tukey lambda family with A, = 2A,. Let F correspond to A, = 2.5 and G to 
\, = 38. Then for the interval 0.1 <u < 0.5, suply,(F)| = suply,(G)| but 
sup y,(F) — inf y,(F) < sup y,(G) — inf y,(G). Essentially a change in the di- 
rection of the skewness within the domain of a distribution does not necessarily 
increase the overall asymmetry—it may help to decrease it. The graph for 
à, = 0.2 is also included to show that the family, at least for Az = 2A,4, is not 
ordered according to any of the orderings of Theorem 2.2. 


3. Skewness properties of some distribution families. The relationship 
of F and F is examined in this section for some important families of distri- 
butions, also illustrating some points of interest from Section 2. All the results 
have been obtained analytically but because the study of sign changes can 
sometimes be laborious, no proofs are given here. Functions such as F~ (F(x), 


1007 


SKEWNESS AND ASYMMETRY 


‘Anum vpyuny Layn], Wyaunutsp up fo ssoqurow 4of "h 


and 


b 


-DLI 





1008 H. L. MacGILLIVRAY 


Fuld. — u) + Fu) — 2mp, or F(¢ + x) + F(¢ — ax) — 1 may be considered, 
and sometimes may all need to be considered. 


3.1. Weibull distribution. This has F(x) = 1- e°, x >0; c>0, 0>0. 

Hence F~'(u) = [87 'log(1 — u)~']'”*, and it can easily be shown that F, <, F, 
for c, > Co. That is, as c increases, the distribution becomes less skew to the 
right. Now it is known that p, changes sign from positive to negative values at 
c = 3.6 [Johnson and Kotz (1970), page 253] and examination of m — M shows 
that this changes from positive to negative values at c = (1 — log2)~! = 3.26. 
Since I, = (0,00), the distribution is always at least tail-skew to the right, 
but the changes in u, and m — M show that F € Fp. Detailed examination 
shows that for c < 1/(1 — log2), F is strongly skew to the right, and for 
c > 1/(1 — log 2), it is centrally skew to the left and tail-skew to the right. It can 
also be shown that for c > 1/(1 — log2), F71 — u) + Fu) — 2mp not only 
has one change of sign but also only one turning point. 
_ The strong result may be shown by considering the second derivative of 
F-\(F(x)) at x = F~'(u). The weaker results may be shown by considering 
derivatives of F~'(u) + F-\(1 — u) — 2mp, and, by considering log f({ + x) — 
log f({ — x), it may also be shown that S-[F™(u) + F71 — u) — 2mp] = 1, 
from positive to negative values for u € (0, 4). 

Figures 3 and 4 show that for most of Ip, the family virtually behaves as a 
strongly ordered family that includes a symmetric distribution. The tail skewness 
has practically no influence except on global measures such as p, and p — m for c 
between 3 and 4. 








ds 10 20 30 40 50 


Fic. 3. Weibull density functions for c = 2, 3.26, 8. 


SKEWNESS AND ASYMMETRY 1009 











Fic. 4. y, for members of the Wetbull family. 


3.2. Johnson system [Johnson (1949)]. For this system F(x) = 0(b(x)) 
where ® is the distribution function of the standard normal and b(x) is an 
increasing function. For skewness properties, ®(z) may in fact be any distribu- 
tion symmetric about zero; the properties are determined by the transformation 
b(x). Johnson identified three main systems called S}, Sy, and Sp, corresponding 
to three general forms of b(x). 

For the S, system, b(x) = 6 logx + y, x > 0. For ô > 0, this is concave and F 
is strongly skew to the right; conversely for ô < 0 and similarly for any concave 
or convex b(x). 

For the S, system b(x) = dsinh~'x + y, —0oo <x < oo, and for the Sp 
system b(x) = ô log[x/(1 — x)] + y, 0 < x < 1. Without loss of generality, con- 
sider § > 0. For y < 0, Sy bale PP ani F <# F while S, bas F <7” F and 


F <£ F, conversely for y > 0. For these two systems, there is an interesting point 

when considering F(¢ + x) + F({ — ax) — 1. In both cases this has no more than 

two changes of sign for any { and a, but when there are two changes of sign there 

sequence of signs depends on ¢ and a so that F and F cannot be ordered by <>. 
The Pearson system also has the property that 


S-(F(¢ + x) + F(§ — ax) - 1) <2, 
but again, for individual distributions it is necessary to check that the sequence 
of sign changes in the case of equality does not change with ¢ and a. 


4. Conclusion. This paper has aimed to bring together into one structure 
the variety of views of skewness that have been previously considered, filling in 
any gaps. The structure so obtained is intended to provide a background for 


1010 H. L. MacGILLIVRAY 


reference purposes as obviously not all the orderings and measures will be 
required for any one situation. 

Yule (1911, page 162) described the quantile measure y),,(-) as a “rather 
rough-and-ready” measure of skewness. Although the relative importance of the 
different orderings and measures depends on circumstances, and it is unlikely 
that any one could be descrived as most important, it appears that the general 
quantile measures y,(-) play a valuable role in discussing both skewness and 
asymmetry. 

A particular point that has emerged in the course of the paper is that 
describing the skewness of individual distributions is not only a special case of 
comparing different distributions, but also may be a necessary inclusion in a full 
description of the comparative skewness properties of a family of distributions. 


Acknowledgments. The author would like to thank Mr. John Zornig for his 
assistance in preparing the diagrams, and the referees for their helpful comments. 


REFERENCES 


BALANDA, K. P. and MACGILLIVRAY, H. L. (1986). Kurtosis and Spread. Submitted for publication. 

BaRLOW, R. E. and Proscuan, F. (1966). Inequalities for linear combinations of order statistics from 
restricted families. Ann. Math. Statist. 37 1574-1692. 

BICKEL, P. J. and LEHMANN, E. L. (1975). Descriptive statistics for non-parametric models. I. 
Introduction. II. Location. Ann Statst. 3 1038-1069. 

BICKEL, P. J. and LEHMANN, E. L. (1976). Descriptive statistics for non-parametric models. III. 
Dispersion. Ann. Statist. 4 1139-1168. 

Bow Ley, A. L. (1901). Elements of Statistics. Staples Press Ltd., London. (6th ed., 1937). 

Davin, F. N. and JOHNSON, N. L. (1964). A test for skewness with ordered variables. Ann. Eugenics 
Lond. 18 351-353. 

Davin, F. N. and Jonnson, N. L. (1956). Some tests of significance with ordered vamables. J. Roy 
Statist. Soc. Ser. B 18 1-20. 

Davip, H. A. and GROENEVELD, R. A. (1982). Measures of local variation in a distribution: Expected 
length of spacings and variances of order statistics. Biometrika 69 227-232. 

DoxsuMm, K. A. (1969). Starshaped transformations and the power of rank tests. Ann. Math. Statist. 
40, 1167-1176. 

Doxsum, K. A. (1975). Measures of location and asymmetry. Scand. J. Statist. 2 11-22. 

Doonpson, A. T. (1917). Relation of the mode, median and mean in frequency curves. Biometrika 11 
425-429, 

FECHNER, G. TH. (1897). Kollektumasslehre. Engleman, Leipzig. 

GARVER, R. (1932). Concerning the limits of a measure of skewness. Ann. Math. Statist. 3 358-360. 

GROENEVELD, R. A. and MEEDEN, G. (1977). The mode, median and mean mequaltiy. Amer. Statist. 
31 120-121. 

GROENEVELD, R. A. and MEEDEN, G. (1984). Measuring skewness and kurtosis. The Statishcian 33 
391-399. 

HALDANE, J. B. S. (1942). The mode and median of a nearly normal distribution with given 
cumulants. Biometrika 32 294-299. 

HINKLEY, D. V. (1975). On power transformations to symmetry. Biometrika 62 101-111. 

Hose, R. V. (1974). Adaptive robust procedures: A partial review and some suggestions for future 
applications and theory. J. Amer. Statist. Assoc. 69 909-923. 

Hoca, R. V., Fisner, D. M. and RANDLEs, N. H. (1975). A two-sample adaptive distribution-free 
test. J. Amer. Statist. Assoc. 70 656-661. 

Hore Line, H. and Sotomons, L. M. (1932). The limits of a measures of skewness. Ann. Math. 
Statist. 3 141-142. 


SKEWNESS AND ASYMMETRY 1011 


JOHNSON, N. L. (1949). Systems of frequency curves generated by methods of translation. Bw- 
metrika 36 149-176. 

JOHNSON, N. L. and Korz, S. (1970). Continuous Untwariate Distributions 1. Houghton-Mifflin, 
Boston. 

KARLIN, S. (1968). Total Positivity 1. Stanford Univ. Press, Stanford, Calif. 

LAWRENCE, M. J. (1975). Inequalities of s-ordered distributions. Ann. Statist. 3 413-428. 

MacGILLivray, H. L. (1979). Moment inequalities with applications to particle size distributions. 
Unpublished Ph.D. thesis, Univ. Queensland. 

MacGiuuivray, H. L. (1981). The mean, median, mode inequality and skewness for a class of 
densities. Austral. J. Statist, 23 247-250. 

MacGILuivreay, H. L. (1982). Skewness properties of asymmetric forms of Tukey lambda distribu- 
tions. Comm. Statist. A—Theory Methods 11 2239-2248. 

MACcGILLIVRAY, H. L. (1985). A crossing theorem for distributions and their moments. Bull. Austral. 
Math. Soc. 31 413-419. 

MAJINDAR, K. N. (1962). Improved bounds on a measure of skewness. Ann. Math. Staust. 33 
1192-1194, 

Mann, H. B. and WuriTwey, D. R. (1947). On a test of whether one of two random variables is 
stochastically larger than the other. Ann. Math. Statsst. 18 50-60. 

Ova, H. (1981). On location, scale, skewness and kurtosis of univariate distributions. Scand. J. 
Statıst. 8 154-168. ' 

ORD, J. K. (1968). The discrete student’s ¢ distribution. Ann. Math. Stanst. 39 1513-1516. 

PEARSON, K. (1895). Contributions to the mathematical theory of evolution II. Skew variation in 

, homogeneous material. Philos. Trans. Roy. Soc. London Ser. A 186 343. 

RAMBERG, J. S., TADIKAMALLA, P. R, DUDEWICZ, E. J. and MYKYTKAa, E. F. (1979). A probability 
distribution and its uses in fitting data. Technometrics 21 201-214. 

Rocers, W. H. and TUKEY, J. W. (1972). Understanding some long-tailed symmetrical distributions. 
Statıst. Neerlandica 26 211-226. 

ROSENBERGER, J. L. and Gasko, M. (1983). Comparing location estimators: Trimmed means, 
medians and trimean. In Understanding Robust and Exploratory Data Analysts (D. C. 
Hoaglin, F. Mosteller and J. W. Tukey, eds.) 297-338. Wiley, New York. 

RUNNENBURG, J. TH. (1978). Mean, median, mode. Statist. Neerlandica 32 73-79. 

‘TIMERDING, H. E. (1915). Die Analyse der Zufalls. Braunschweig. (In Leiden Univermty Library. 

VAN ZWET, W. R. (1964). Convex transformations of random vanables. Math. Centrum. Amsterdam. 

VAN ZWET, W. R. (1979). Mean, median, mode II. Statist. Neerlandica 33 1-5. 

YUuLE, G. U. (1911). Introduction to the Theory of Statistics. Griffin, London. (13th ed., 1944). 


DEPARTMENT OF MATHEMATICS 
UNIVERSITY OF QUEENSLAND 
St. Lucia 

QUEENSLAND 4067 

AUSTRALIA 


The Annals of Statishes 
1986, Vol. 14, No. 3, 1012-1029 


ON THE ASYMPTOTIC FORMULA FOR THE PROBABILITY OF 
A TYPE I ERROR OF MIXTURE TYPE POWER ONE TESTS?! 
By Moshe POLLAK 
The Hebrew University of Jerusalem 

Let X,, X2,... be tid with density f, with respect to a sigma finite 


measure p, where {f} ao, QG R, is an exponential family. Let F be a 
probability measure on Q and let 6, € Q. Define 


f hX) h Xa) 
a fo Xi) fe( Xn) 


T( B, F) = œ if no such n exists. Previous studies have found that if F has a 
positive and continuous density with respect to Lebesgue measure on Q, then 


T(B, F) ~ nin dF(y) >a), 





BP (T(B, F) < 0) > asf f° exp{-x} dHp(x) dF(8), 


where Hg are certain measures arising in a renewal-theoretic context. 

Here we show that in a nonlattice context, this convergence holds for 
general probability measures F. We also show that the convergence is uniform 
for all probability measures F whose support is contained in an arbitrary ' 
interval [a, b] interior to Q, if the distribution of X, is strongly nonlattice 
for all y € Q. 


1. Introduction and summary. Let Q be an open interval on the real line 
and let {f,} eg be the densities of a one-parameter exponential family with 
natural state space Q with respect to a sigma-finite measure p. Denote: 


f(x) = exp{(xx-—v(y)}, -o<x<o, yea. 


Without loss of generality, assume that 0 = (0) = (0). Let X,, X,,... bea 
sequence of iid random variables, let P, be the probability measure (on R”) 
under which X, have density fọ with respect to g, and let E; denote expectation 
under P}. Let F be a probability measure over 2 with F({0}) = 0. Denote: 


Ln, y) = TILDE], 


L(n, F) = Í L(n, y) dE(y), 


T(B, F) 


ll 


min{n|L(n, F)> B} 
œ if no such n exists. 


Received October 1983; revised June 1985. 

‘This research was supported by a grant from the United States—Israe) Binational Foundation 
(BSF), Jerusalem, Israel. 

AMS 1980 subject classifications. Primary 62L10; secondary 62F05. 

Key words and phrases. Power one tests, mixture-type stopping rules, strongly nonlattice, renewal 
theory, nonlinear renewal theory. 


1012 


MIXTURE TYPE POWER ONE TESTS 1013 


The statistical test which stops at T(B, F) and rejects H,: 0 = 0 in favor of 
H,: 0 + 0 has power one for certain values of 0 [cf. Robbins (1970)]. It is known 
that [Lai and Siegmund (1977) and Woodroofe (1982), Section 6.1] the signifi- 
cance level of this test is 


(1) P,(T(B, F) < œ) = Í; E,[1/L(T(B, F), F)] dF(@). 


Let S? = yE X, — ny(y), let t = min{n|S? > A}, + = œ if no such n exists 
and let p, = S? — A on {7 < oo}. If the Pydistribution of yX, — ẹ(y) is non- 
lattice, it follows from standard renewal theory [cf. Feller (1971)] that under P 
p, has (as A -> oo) a limiting distribution H,. If F has a positive aadu 
density with respect to Lebesgue measure on o then by Lai and Siegmund (1977) 


(2) BP (T(B, F) < ©) > p.0f | “exp{ —x} dH,(x) dF(8) 


[see also Woodroofe (1982), Section 6.2]. The method involved in the proof of (2) 
is nonlinear renewal theory [developed by Woodroofe (1976) and Lai and 
Siegmund (1977); see Woodroofe (1982) for a survey]. Formula (2) yields an 
approximation for the significance level of the test associated with T(B, F). Tae 
approximation is remarkably good, even for low values of B [Lai and Siegmund 
(1977)]. 

Let 5, denote the probability measure degenerate at 0. For testing Ho: 0 = 0 
against an alternative H,: 8 = y with a power one test, T( B, 5,) is optimal in the 
sense that it has least P,-expected sample size among all power one tests with 
significance level a < P,(T(B,4,) < oo). If the alternative H, is not simple, 
T(B, 5,) may not yield a test of power one at every point in the alternative and is 
not efficient for values of 6 other than y. One can maintain power one and 
asymptotic (B — oo) efficiency at every point in the alternative by employing a 
rule T(B, F) with F having a positive continuous density for all points in the 
alternative [see Pollak and Siegmund (1975) and Pollak (1978)]. The asymptotic 
efficiency of T(B, F) is manifest when B is large. When B is not large T(B*, ô) 
will have a significantly smaller P,-expected sample size than T(B, F) [where B* 
is such that T(B*, 5,) and T(B, F) yield tests with power one at 9 = y having 
the same level of significance]. For reasons of continuity, E,7T(B*, 6,) will be 
smaller than E£,7T(B, F) for a sizeable 6-neighborhood of y. As for practicality, 
the integration involved in computing L(n, F) may make application of T( B, F) 
cumbersome. Therefore, choosing a measure F* concentrated at a single point or 
having atoms at a few points and employing 7(B, F*) may be much more 
appealing than using T( B, F) with continuous F. The range of values of B where 
this may be the case is large, and seems to include many of the “ practical” cases. 
{For an indication of this, see Pollak and Siegmund (1985).] While the test 
associated with T( B, F*) may not have power one at all points in the alternative, 
averaging F* with a continuous F will rectify this, at the same time retaining 
reasonably good efficiency for most points in the alternative. Therefore, there is 
an interest in establishing (2) for a wider range of measures F. 


1014 M. POLLAK 


Other questions of interest concern the uniformity (in F) of the convergence in 
(2). These become of importance when several measures are being considered or 
when the measure F is random. [This may be the case, for instance, if a machine 
requires calibration at a value 9 = 0 at the start of each day, a power one test is 
daily employed on the products to check whether 0 = 0, and the measure F, 
which represents the values of 6 when @ + 0, is daily updated. For another 
example of a random F, arising in a different context—one where the uniformity 
of the convergence is crucial—see Pollak (1983).] 

Here we show that in case yX, — (y) are nonlattice, the convergence in (2) 
exists for general probability measures F. We show that this convergence is 
uniform for all probability measures F whose support is contained in an arbitrary 
interval [a, b] interior to 2, under the restriction that the P,-distribution of X, 
be strongly nonlattice [see Stone (1965)] for all y € Q. (This requirement is 
fulfilled, for example, in case the observations are normal or exponential. It is not 
fulfilled if they are binomial or Poisson.) It should be noted that the uniformity 
result may not hold if the strongly nonlattice assumption is not satisfied (e.g., the 
Bernoulli case). . 

Even when the strongly nonlattice assumption is satisfied, the uniformity 
result is not transparent. The standard approach of decomposing a nonlinear 
renewal process calls for representing L(n, F) via 


n 
(3) log L(n, F) = 6 YX, — ny (0) + &(n, 6, F), 

t=] 
where &(n, 6, F) are sequences which are slowly varying. If these are slowly 
varying uniformly in 6 and F, letting T = T(B, F), this representation would be 
applied to 


(4) BP,(T(B, F) < œ) = f Egexp{ — [log L(T, F) — log B] } dF(0) 


[which is equivalent to (1)] to yield a uniform convergence in (2). The difficulty is 
that the representation (3) may fail, let alone {(n, 0, F) slowly vary uniformly 
[e.g., consider the N(6,1) case with F = }8,_, + 48,,.—the representation fails 
for 0 = y]. Looking at it differently, to each B, F there corresponds a (one-sided 
or two-sided) boundary y(¢) via the relation 


B= fetare) — ty(y)} dF(y). 


The stopping time 7(B, F) is equal to the first time n that the sequence of 
partial sums }Ł? X, crosses the boundary y(t), and one can try to get the 
asymptotics of the overshoot £72 PX, — y(t) to account for uniform conver- 
gence in (2) via smoothness properties of y(t). However, it is generally not the 
case that for “neighboring” mixing measures F,, F, the corresponding boundaries 
y,(t), Y(t) are close uniformly in B. [For instance, consider F, = Fp Fy = 
(1 — e)5, + eô, where 0<a<b. An easy calculation shows that y,(0) = 
(log B)/a, while y,(0) = dog B — log e)/b + o(1).] 

Nevertheless, due to the following reasoning, the uniformity result is true. By 
virtue of (4) it is enough to show that the representation (3) and the uniformity 


MIXTURE TYPE POWER ONE TESTS 1016 


in @ and F of the slowly varying characteristics of {(n, 8, F) hold not for all 8, 
but for a 8-set A,(F’) having arbitrarily large F-probability—larger, say, than 
l — e, e arbitrary. It does not matter if A (F) varies with B or F, as long as 
1 — e remains a lower bound for its probability and the slowly varying character- 
istics of (n, 0, F) continue to hold for n in the vicinity of T(B, F). The rigorous 
presentation of this reasoning is the content of the proof supplied in this article. 


2, General convergence. We will use the notation of the previous section. 

THEOREM 1. Suppose that F{0}=0 and F{{0|0X, — (0) has a lattice 
P,-distribution}} = 0. Then 

BP,{T(B, F) < œ} ase) exp{ —x} dH, (x) dF(8). 

Essentially, the idea of the proof is to apply Lemmas 1 and 5 below and the 
nonlinear renewal theorem to (4) above. 

Let a < b be interior points of Q. Fix 6>0, 8 = $, p=}, and a=. 
Denote 

T=T(B,F), X,=072,X/n, e=e(n) =n, 


r =K(n, 8) = min{e, 48, 0y WO) }, a = a(n) = nèt, 
S[a, b] =set of all probability measures F whose support is contained 
in [a, b], and which satisfy F{0} = 0 and F{{6|@X, — (0) has 
a lattice distribution}} = 0, 
KO) =0y'(8) — ¥(8), l = (log B)/ KO), 
m, =the integer value of lọ — (1,)°p/4, 
n, =the integer value of lọ + (1g)"p/4, 
eg =(rrg)P" 72, ef = min{eg, 319), 101/ fW"(8) }, op = (ma), 
A, =A,(F) = {08 # 0, FLO — 4x, 6+ 4x]} = ôx for all mg < n < no}, 
Rg =R AF) = Ag (6||6| > Gog B) 1}, 
0* =is defined by y'(0*) = X,, 
9* =0*(0, B) = 0| + [Q ~ ple], 
n* =n*(6, B) = min{|@|~?/0~™, 714}. 
The following lemmas are stated in somewhat greater generality than needed 
for Theorem 1 so as to enable their use for the proof of Theorem 2. 
LemMMaA 1. There exists 0 < B, < œ such that if B = B, then 
F{agerienent} < 5(b— a) whenever F € S[a, b]. 


Proor. Suppose a < 0 < b. Suppose 0 + 0 € [a,b] and F{[0 — ix, 
0 + 4]} < 5« for some mg < n < ng. There exists 0 < B, < oo (independent of 


1016 , M. POLLAK 


0, F) such that if B> B, then, for such 6, F{[@— że, 0 + $e}]} < 28e} 
whenever F € S[a, b]. Therefore 


(Ap) c {O\F{(O — Leg, 0 + 4e$)} < 2608} 


if B > B, 
Let b= b, and define recursively b, = max{9|0 < 6 < b,_, — 38b., 
F((@ — tež, 0 + 4e$)} < 28e§}, i= 1,2,..., and define 


D, = {6|b, — heh, < 0 < b, + deh}. 


Thus, (0, b] N (Ag)™™'=* C U, D, Clearly, F{D,} < 28e}, = 25|D. Also 
D, ^ D, = ¢ if |i -j| > 1, so F{U,D,} < FU;D,,} + FUD; 11} < 28b + 280% 
+28b < 58b. Consequently, F{(0, b] N (A jermen) < 56b. A similar argu- 
ment holds for [a,0), so that F{(Ap)™™P!™="} < 56(b — a). The argument 
for 0 < a < b and for a < b < 0 is analogous. O 


Lemma 2. Let 0 <n < œ and denote 
Ca, 8, P) = log E(n, P) | f”? y) a) 
8 —(1/2)% 
Then 


sup sup max Pal max| s(n +j,6, F) — &(n,6,F)|> n) >g. 
FES[a, b] 9EAp n>(1-p)lp \J2l 


Proor. Let W? denote the event {|6* — 0| < £}. There exists a constant 
c > 0 independent of n, 6, F such that if a < @ < b, then 
(5) (|X, - ¥'(9)| < ce} c Wy. 

Following the proof of Lemma 2 of Pollak and Siegmund (1975), for A > 6 

P,{X, — ¥'(8) > 2} 

ie = fo yoa PiL — A)%n = (HO) = WAN)]} dP, 
< exp{—n[(y/(8) +2)(A - 8) — (¥(A) — »(8))]} 

xP {X — ¥'(8) > z2}. 

Setting z = ce and A = 6 + n”/8-1/2 yields for large enough n 
(7) P,{X, — V'(0) > ce} < exp{ — ten} 
uniformly for a < @ < b. Hence 

sup 2 P,{X, — ¥'(8) > ce} > p0. 


as@<6bner 


MIXTURE TYPE POWER ONE TESTS 1017 


A similar analysis yields 
sup >. P,{X, = y'(8) < —ce} Saah 
as@<ban=r 
Hence, by (5) 
(8) inf r A we) Faak 
as$<b one 
Note that 


log L(n, 6 + A@) 
= n|(0 + Ad)X, — ¥(0 + AG)] 
= n{0X, — (8) + AO[z, — w'(8)] — 4y’(9)(A8)"[1 + 0(1)]}, 


where o(1) > 0 as A@ —> 0 uniformly in 0 € [a — łe, b + }łe]if a — Łe, b + hee 
Q. For large enough n, for 0 = 8*, this becomes 


log L(n, 6* + A0) = n{0*X, — ¥(0*) — 4y"(0*)(AG)"[1 + o(1)]}. 


Therefore, there exists B, (independent of F) such that if B > Bọ and @ € åp, 
on W? for n > (1 — p)l,, if F € S[a, b] 


{eL(n, y) dF(y) 
feat (n, y) dF(y) 
Oa a, + [2-9)L(n, y) dF( y) 
fe tgL(n, y) dF(y) 
L(n, 0* + 0/4) + L(n, 0* — 04/4) 

Sesmin{ L(n, 6* — eg), L(n, 0* + €9)} 

2exp{n|6*X,, — ¥(8*) — 4W"(8*)(09/4) [1 + o(1)]]} 
egexp(n|6°X, — ¥(8*) — 4y"(8*)(e9)°[1 + 0(1)]]) 


de 2exp{ — 2 [ming <0 <,0(9)] of (1 Ki ply} 
be 


<it 





s1lt+ 





silt 





<1 


> Bol- 


This, together with (8), accounts for Lemma 2. O 


LEMMA 3. Let0 <1 < œ. Let 
K,(8) = iny” (0){ [Xn - WOA OY, 
I ¥,8,4) = dnw'(8){y -0 - [X, - OOF 
+n(y— 0)" (A)/6. 


1018 M. POLLAK 


Let {r,(6)}"_, be any random sequence such that 0-0 <A (0) <0 +o. 
Then 


(i) sup Pf _ max 


as6<b 


ye Rents 9) = K n(9)| = n} > naoh 


(ii) sup r| max max | Insy( 0, Ana (0)) 


a<6sb 6-asysO+o yml,..., pn 


—J,(¥,9,,(8))| = n) > n=l. 


Proor. Part (i) follows from Proposition 1 of Lai and Siegmund (1979), 
noting that the proof of this Proposition 1 can be carried through uniformly for 
a<6<b. 

As for part (ii), note that 


Jn( 50,4) =n(y— 0) y” (A)/6 + K,(8) + dW"(8)(y — 8)'n 


~¥[X,-v(8)|(y 9). 


tal 


Therefore, for 9-0 <y <8 +0,1 <j < pn“ and large enough n 
(Iah 0, Ana (0)) — ICY 8, An(9))| 
< pno? sip jy” (0) + wp eee: n 











(9) as6@<b 
n+j 
+|Kn (8) — K,(8)| + ¥’(8)pn%? +) X [X,- w'(8)] |o 
tan+1 
uniformly for a < 0 < b. By Kolmogorov’s inequality, 
n+J 
a max | © [X,-¥(8)] |o> vl 
J Uy pn? | pene] 





tont+l 


n+) 2 
z a Ps max | 2L [X- vo] : (v/a) 
< W"(8)pn%e?/(9/4)” > 4.400 
uniformly for a < @ < b. Part (ii) now follows from (9) and part (i). 0 
LEMMA 4. Using the notation of Lemma 3, denote 


Q,(9, F) a tog f Oep] Shee 6, A,(9))} dF(y). 


MIXTURE TYPE POWER ONE TESTS 1019 


Let 0 < q < œ. Then 


sup sup max P| max |@n(0, F) - Q.(0, F)| > n] 
FES[a, b] 0EAp mgsnsng J™l,..., pri A(ng—n) 

> Bal. 
PROOF. 


Qn- (9, F) = @,(4, F) 
= 1og[ | r tenp = [Ins 9, 9, Ansj(O)) = al 9s 8, Aa(0))]) 
xexp{—d,(y¥,9,A,(8))} dF(¥)/ 


[exp II, 8, 04(0))} FO). 


This is the logarithm of an expectation of exp{—[J,4 (Y, 9, Ans (9) — 
Jy, 9, A,,(8))]}. Lemma 4 is therefore a consequence of Lemma 3(ii). 0 


LEMMA 5. 
log L(n, F) = 6) X, — np (0) + &(n, 8, F), 


teal 


where {£(n, 0, F)}%_, is a sequence of random variables which satisfies for any 
0<17 <0 


sup sup max Pi max (n+ j,6,F) — &(n, 6, F)| > n) 
FeS[a, b] 0GAg mgsnsng \JTh e- pnn n) 
= Bac” 


Proor. Using the notation of Lemma 3 and Lemma 4, for 6¢ A, and 
My < n < Ng there exists À „(0) € (0 — o, 8 + o) such that 


log f° '"L(n, y) dF(y) 
0— (1/2) 


= log f°” *exp(n[ 9X, — ¥(y)]} dF(y) 
0-1/2% 


= log [+ exp (n |X, — WO) + (7 = OF, ~ ¥(0)) 


—(1/2)ay 
—1(y — 6)?"(8) — (y — 0)’ y” (A (0))/6]} aF( y) 


=% D X, — ny(0) + K,(6) + Q,(0, F). 


i=] 


Lemma 5 now follows from Lemma 2, Lemma 3(i) and Lemma 4. 0 


The following two lemmas are needed in order to apply the nonlinear renewal 
theorem. 


1020 M. POLLAK 


LEMMA 6. 
sup sup P,{T(B, F) > ng} > pg- 
FeS[a, 6]9ER, 
Proor. Denote w, = X, — Y'(8). 
L(n, F) = fexp(n[ yX,- ¥(y)]} dF(y) 


> [exp (nl ao) + wn) — ¥(y)]} dF(y) 


—(1/2)e 


= [exp (n [0 ¥'(8) +a) + (y OyO) o) 


=(1/2)e 
—[y(8) + (y= 8)¥'(8) + iy- yE ]]} dF), 
where |§ — 0| < te; so for 8 € Ap, n > ly and large enough B, 
(10) L(n, F) > exp{nI(4) }exp{ —n|a,|(|6| + łe) }exp{ -y"(0)n??} ôx. 
Denote z = {I(0)p/[&(\6| + 4e)]}(4,)"°-”. By (6) above, setting A = 
0 + 2/4"(0), 
P,{w,, > 2} < exp{—nol(y'(8) + z)(A — 8) — (4(A) -4(0))]} > 200 


uniformly in 9 € Ry, and similarly Pilon, < —-z} > Baol 
Therefore, inserting n, instead of n in *(10), it follows “that 


i L B 
TA ‘4 a P,{L(ng, F) > B} > gsal, 


which is equivalent to the statement of Lemma 6. 0 


Bow 


LEMMA 7. 
sup sup P,{T(B, F) < m} >3..,,0- 
FeS[a, b) 6ER, 


Proor. Let T= T(B,F) and denote S, = nX,. Following the proof of 
Lemma 3 of Pollak and Siegmund (1975), for any y > 0, 


Po{T < mo} < PS, =E mg = y[¥"(8)me]'”) 
+ aS — 0 
(11) Gancio oe gi mel )) ei 
< Pif Sm, — ¥'(8)mg > y[y"(0)me]'”)} 


+exp{I(6)m, + 0y[y"(0)m,] 7} PLT < mo}. 
Since PT < m} < PT < œ} < 1/B [ef. Robbins (1970)], for y = 


(m)'/°/ J4” (0), the second term on the far right side of (11) is less than 
exp{ — (log By} when B is large enough, for all 6 € Ry. Also for large enough 


MIXTURE TYPE POWER ONE TESTS 1921 
B, for the same value of y, for all 9 € Rp 
Pi Sn, — ¥'(8) img > y[¥"(8)mg]“”} 
< Pf Xn, — ¥'(8) > (m) 0A) 


Ss exp{ 7 4(m,y)'/°} = Bees 


where the last inequality follows from (7) above and the convergence is uniform 
in 8 € [a, b] — {0}. This completes the proof of Lemma 7. 0 


PROOF OF THEOREM 1. Let a, < b, < 0 < a, < b, be interior points of Q 
and denote T = [a,, b,] U [a;, b2]. Suppose first that the support of F is 
contained in T. By virtue of (4) and the definition of T 


Í E,exp{ — [log L(T, F) — log B] } dF(@)< BP {T < œ} 
G2)” 
s E,exp{ — [log L(T, F) — log B] } dF(0) + F(Aggrriment) , 


Letting =(A) denote the indicator function of the set A, define 


$(0) = =(9 € Aa). 





Eexp{ — [log L(T, F) — log BJ} — [exp(-z) dH, (x) 





Lemmas 5, 6, and 7 ensure that the considerations of the proof of Theorem 1 of 
Lai and Siegmund (1977) carry through for ĝ € A p, so that $(@) > 2, ,,0. Hence 
fro(8) dF(@) > g0, which implies 


J, Beexo( log LT, P) ~ tog B1) FCO) 


(13) 

af f l-2) ail) dF(0) Seg S 
Clearly 
ff e —x} dH,(x) dF(@) — f few ~x} dH, (x) aR(6)| 





< F{Aggnplement} 


Since 6 is arbitrary, (12), (13), and (14) in conjunction with Lemma 1 account 
for Theorem 1 for F whose support is contained in T. 

For general F, suppose first that — oo = inf{x|x € Q} and sup{x|x € R} = oo. 
Let y>0O and choose -œ <a, <b, <0<a,< 6, < œ such that F(T) > 
(1 — y) where T =[a,, b,] U [ay, b2]. Let B* = B/F{T} and let dF*(x) = 
dF(x)/F{T} for xeT; dF*(x)=0 otherwise. Clearly {T(B, F) < «} 2 


1022 M. POLLAK 


{T( B*, F*) < œ}. Since F* € S[a,, ba] 
lim inf BP;{T(B, F) < o} > limingBP,(T(B*, F*) < oo} 
= %0 =o 


> | [~exp(—x} dH,(x) dF(8)(1 — Y). 
J, exp(—x)} dHo(x) dF(8)(1  ) 
On the other hand, from (4) it follows that 

BP,{T(B, F) < œ} < fexp{—[log L(T, F) ~ log B] dF(8) + y 

T 

and so in a manner similar to the proof for F € S[a, b] it follows that 

lim sup BP,{T(B, F) < œ} < f f exp{ 1) dH;(x)dF(0) + y. 

Bow T“0 


Decreasing y towards zero—i.e., increasing 6, to zero and b, to oo, and 
decreasing a, to — œ and a, to 0—concludes the proof. 

If inf(x|x € Q) > —oo or sup{x|x € Q) < œ, a similar proof is valid. The 
details are omitted. 0 


3. Uniform convergence. We will continue to use the notation of the 
previous sections. The distribution of X is said to be strongly nonlattice [Stone 
(1965)] if lim inf} |Z exp{itX} — 1| > 0. 


THEOREM 2. Suppose that the P,-distribution of X, is strongly nonlattice for 
ally € Q. Let a < b be interior points of Q. Then 


BP(T(B, F) < ©) > g-a [exp{—x} dHy(x) AFCO) 
uniformly in F € S[a, b]. 


The proof breaks down into two parts: for 0 € R ,, the proof is similar to that 
of Theorem 1, with the strongly nonlattice property ensuring uniform conver- 
gence. For @ E€ A, — Ep, the proof shows that 0X, — ¥(@) is stochastically small 
enough to ensure that any “overshoot” is negligible. The details are spelled out in 
the following lemmas. 


Lemma 8. If X is strongly nonlattice then so is cX +d for any c #0, 
-%0 <d< %0. 


Proor. It suffices to show that if X is not strongly nonlattice, then neither is 
cX + d. Without loss of generality, let c = 1. 

Suppose there exists a sequence {t,}*,, |t,]| > ,... 00 such that Ee's* >; nol. 
Then e'%* -> „_, „ 1 in probability. Therefore, for any integer k, e*#* — ,_,,,1in 
probability, and this convergence is uniform for any finite set of integers 
k=1,..., m. 


MIXTURE TYPE POWER ONE TESTS 1023 


Let ņ > 0. If j is large enough, then max,_, __,,|He'*#* — 1| < n. Clearly, m 
can be chosen to be large enough so that sup, . min, < 4 <m_le" sd — 1) < y. Thus, 
for any large enough j, there exists k, € {1,..., m} such that 
| Eeth X+a) _ 1| = |( Bet* te 1) ae (ett = 1) ¥ ( Eet*s4X EA 1)(e™®5t = 1)| 

< 2q +n. 
Letting n — 0, it follows that X + d is not strongly nonlattice. 0 


Lema 9. Let Y,,Y,,... be strongly nonlattice tid random variables with 
EY, = 0, P(Y, = 0) <1. Let 


V= minn ÈY > 0), Z= SY. 


tm] 





Then Z is strongly nonlattice. 


Proof. The lemma is trivial if P(Y, > 0)= 1. Consider the case that 
P(Y, > 0) < 1. One must show that lim inf 4 „| Ee"? — 1| > 0. Suppose this were 


not the case, but that there exists a sequence {¢,},, |t| >; such that 
Ee‘? => )_, „1. Then 

z 
(15) ef > aol 


in probability. Since Ee’? = EE(e"s7\Y,), it follows that on {Y, > 0} (on which 
Z= Y) 
(16) P(e% > UY, > 0) =1. 

Let u* = sup{x|P(Y, <x) <1). Clearly, u* > 0. Let U, = (—u*,0]. If 
u* < oo, let U, = (—ku*, —(k — 1)u*], k =1,2,.... 


Suppose Y, € U,. Then P(V = k + 1, Y, > 0,..., Yp}, > OY) > 0. On this 
event 


k+l 
(17) HZ = eth [] ett 
m=2 

From (16) it follows that 

k+1 

P| [J et > ,_ UY, V=k+1, Y, > 0,..., Yp >0| =1 

m=2 
a.s. on Y, € U,. From (15) and (17) it therefore follows that 

P(e > UY, © U,) = 1. 


Therefore P(e” >, _,.1)=1 and so Ee” > „l, contradicting the as- 
sumption that Y, is strongly nonlattice. O 


LEMMA 10. Let S? = 6D°_,X,—ny(6), let 7 =7(0, A) = min{n|S? > A} 
and let H, be the P,limiting distribution of S? — A as A > œ. Then the 
convergence in P,-distribution of S? — A is uniform in 0 € [a, b] — {0}. 


1024 M. POLLAK 


Y, = 6X,— (8), 
V, = 0, Vj = minnn > Vor. A > 0}, | ea ee 


t= V,_,4+1 


and 


By Lemmas 8 and 9, the P,-distribution of Z, is strongly nonlattice. 

Let G denote the renewal function defined by G,(x) = L%_,G)"(x) where 
G{” is the distribution of £7_,Z,. By Theorem (ii) of Stone (1965), there exists 
r > 0 (r = r(8)) such that 


(18) Go(x) = x/p; + #a/(2p3) + Yo(x), 
where p, = E,Z,, Ka = E,Z? and 
(19) Yo(x)exp{rx} >, ...0. 


Let 0 < y. A check of Stone’s (1965) proof reveals that there exists a constant 
r > 0 independent of 6 € [a, b] — [—y, y] such that the convergence in (19) is 
uniform in @ € [a, b] — [—y, y]. Now 


P(S? — A > x) 


œ n-1 n 
Èn EZ<A LZ>A+x 





(20) = 5 J I- GPKA + x s) Gds) 
= fu — GPA + x — s)]G,(ds). 
Note that 
fp - GP(A + x - s)| ds Haaa — GP(x + 8)] ds 
(21) 


= f I- Gp%s)] as 
uniformly for x > 0, 8 € [a, b] — [-y, y]. Also, letting Y, be as in (19) 
f IL- Gp(A + x- s)]Yp(ds) 
0 


=[1- GP(A + x- s)}Y,(s) |$ + f RaP + x — ds) 


(22) = [1 - Gf(x)]¥,(A) - [1 - GP(A + x)]¥4(0) 


- J Y (A = s)GP(x + ds) 
0 
gi A> m0 
uniformly for x > 0, 8 € [a, b] — [—y, y]. Combining (18) and (20) together with 


MIXTURE TYPE POWER ONE TESTS 1025 


(21) and (22) proves that the convergence of the P,-distribution of S? — A to He 
is uniform in 8 € [a, b] — [—y, Y]. 

Now suppose 6 € [—y, Y] — {0}. By Theorem 3 of Lorden (1970), for any 
A 20, 

E,(S?— A)’ < SE{[0X, — ¥(0)]* }*/1(@) < 204E, X4/I(8) > 4.0. 
Hence, S? — A >, _,)0 in P,probability, uniformly in A > 0, and thus also 
Hy — o o 5,o)- Therefore, if y > 0, there exists y > 0 such that the Lévy distance 
between H, and the P,-distribution of S? — A is bounded by n, uniformly for 
A 20, 6€[-y,y] — {0}. By the above, there exists A, such that if A 2 A, 
the Lévy distance between H, and the P,-distribution of S? — A is bounded by n 
uniformly for 6 € [a, b] — {0}. Since ņ is arbitrary, the proof of Lemma 10 is 
complete. O 


LEMMA 11. 
sup sup P,{ T(B, F) s (1 ag p Jig} > B>œ0. 
FeS[a, b] 9€[a, b]- {0} 


Proor. The proof is analogous to the proof of Lemma 7, replacing m, 
by (l —p)l and y by p[J(@)log B]'/7/[2\0K~"(6))'7]. The second term 
on the rigat side of (11) is less than exp{— 4p log B}. There exists a con- 
stant c, > 0 such that the first term on the right side of (11) is less than 
Pi{ Xa -p — ¥'(9) > ¢,|6]}. By virtue of (6), there exists a constant c, > 0 such 
that this probability is uniformly less than 

exp{ — (1 — p)lgc07} = exp{—c,(1 — p)[0?/1(8)]log B} > 5,00 
uniformly for 0 € [a, b] — {0}.0 


LEMMA 12. Let 
1 \2/0-38) 
n* = (a) V (79). 
Then 


sup sup P,{T(B, F) >n*} >, 
FeS[a, b] 0GA,—Ry 


0. 


Proor. Let w, =X, — ¥’(0). For large enough B, for 6 € Ap — Rpg 


O+(1/2)eF i 
L(n, F)= f Vc. "aply È x,- n(o) dF(y) 
O-(1/2)e§ tm] 


= nl) fo exo (nl y- 9)y(8) 
—[¥(y) = ¥(8)] + yo,]) dF(y) 


n(e} j e 7llonin get 


v'(8) 
4 





> or Pex| ~ 


> e" K-80 /A)p— Alwin ôež. 


1026 M. POLLAK 
Since |6| > (n*) 7073812 it follows that n*8? > (n*)°*; and, since also lọ < n*/7, 
it follows that ež > (n*)*-!/”. Therefore, for large enough B (for 0 € Az — Rp) 
L(n*, F) = dexp(n*[ 16? — 210] jonl — [4 — B] (log n*)/n*]) 
> dexp(6?n*[1 — 20,116] — [$ — B] (log n*)/(n*)**]). 


Following the notation and proof of Lemma 2, if |X,. ~ '(8)| < e(n*)P-17?, 
then 


2|a,-|/18| < 2e(n*)P- 0/9 + -38/2 2e(n*) 4? 
so that (for large enough B) 
L(n*, F) > exp{0?n*/6} > B. 
Lemma 12 now follows from the fact that by (7) 
ue Pi{|Xns — y'(0)| > e(n*)P 771 > 0.0 


bEApg— Rp 


LEMMA 13. Letn* be as in Lemma 12. Denote 6* = |6| + [(1 — p)l;]?f 71. 
Let x > 0. Then 


sup sup P max 0#|X,> x} > panl. 
FES[a,b]0EAp-Rp t=1, ..,n* 


Proor. Let (27) > 0 be in the interior of Q. For all 8 € Ap — Ry, when B is 
large enough 


P max 6*X, > x) =1- [1 - B{0*X, > x}]” 


=1- [1 — P,{exp{X,} > exp{nx/0*}]” 

<1- [1 - E,exp{nX,}/exp{nx/0*}]” 

<1- [1 — exp{y(2n)} /exp{nx/0*}]™ 
0. 


A similar analysis for P,{min,., | ,.0*X,< —x} completes the proof of 
Lemma 13. O 


D B+0 


PROOF OF THEOREM 2. By virtue of (1) and the definition of T 


4 E,exp{ — [log L(T, F) — log B] } dF(0) < BP,{T < œ} 


(23) 
< f Esexp{—[log L(T, F) - log B]} dF(0) + FLAG). 


Lemmas 5, 6, 7, and 10 ensure that the considerations of the proof of Theorem 1 
of Lai and Siegmund (1977) carry through uniformly for 8 € Rp, F € S[a, b], so 


MIXTURE TYPE POWER ONE TESTS 1027 
that 


sup J, Boexe( - [log L(T, F) — log B]} dF(6) 


FeS[a, b] 





(24) 
- J, [exo( x) ati) ar(0)| » 


Let y > 0. By (24) there exists B, such that if B > B, 


sup 
FeS[a, b]|°R 





E,exp{ — [log L(T, F) — log B]} dF (6) 
(25) 





=f f “exp{ —x) dH,(x) dF(0) 


B 


< 7/4. 


CasE I. If F{A,— Rp} < y/8, then it clearly follows from (25) that if 
B>B, 
i E,exp{ — [log L(T, F) — log B] } dF(@) 
B 
(26) 
< y/2. 





- f f, exp —a) atto(x) aC) 


Case II. If F(A, — Rp} > 7/8, let 0* be as in Lemma 13. By virtue of 
Lemma 11 and Lemma 2, if ņ > 0, then 





sup sup Pa(|log LCT, F) = 10g f** 2-207, y) ar(»)| > n) 


FES[a,b]0EAp- Rp 
0. 


> Ba 
Clearly 

g- 

pg, B(T, y) dF(y) exp (0*1 Xr} B. 


Lemmas 12 and 13 therefore imply that log L(T, F) — log B > g0 in 
P,-probability uniformly for 0 E€ A, — R g. Since Hy > g o 80) (see the proof of 
Lemma 10), it follows that there exists B, such that if B > Dae 





J rs - [log L(T, F) — log B] } dF(@) 


âg 


aie [exp —x)} a(x) dF(8) < 7/4, 


Bo 


1028 M. POLLAK 


so that by (25) if B > max(B,, B,), then 


i E,exp{ — [log L(T, F) — log B] } dF(#) 





(27) 
E [ ela} dH, dF(0)| < /2. 
Ap’d 
Clearly 
(28) j [esp -2) dtte(x) dF(0) 


b ro 
- ff exp{-x)} dH,(x) dF(0) 
a "0 
By virtue of Lemma 1, ô can be chosen so that 
(29) Fi Aparna) < y/4, 


Now (23), (25), (26)/(27), (28), and (29) imply that there exists B, such that if 
B > B, then 





< F{agretenent) 


pele, [BAIT < oo} — S [rex{-2) dH, (x) dF(@)| < y. 


Since y is arbitrary, this completes the proof of Theorem 2. O 


4. Remarks. The set {@|6X, — $(@) has a lattice P,-distribution} is at most 
countable [Woodroofe (1982), Section 6.2], so that the restriction in Theorem 1 is 
not prohibitive. If F gives positive probability to a value @ for which 6X, — (8) 
has a lattice P,-distribution, then it can be shown that 


n 
log L(n, F)=6 >, X,- ny(@) + log F{@} + 0,(1), 
tl 
where 0,(1) > „œ 0 in probability. Therefore, Theorem 1 will not hold as is; the 
lattice property is not asymptotically negligible. 

The strongly nonlattice property is utilized to attain uniformity in the 
convergence of the distribution of S? — A (Lemma 10). If other means can be 
found to yield uniformity—such as regarding probability measures F whose 
support is a finite set of points—then the requirement of the strongly nonlattice 
property may be relaxed. 


Acknowledgments. The author would like to thank the referees and the 
Associate Editor, whose comments brought about more natural proofs than those 
supplied in the first version. 


MIXTURE TYPE POWER ONE TESTS 1029 


REFERENCES 


FELLER, W. (1971). An Introduction to Probability Theory and Its Applications 2. Wiley, New York. 

Lal, T. L and SIEGMUND, D. (1977). A non-linear renewal theory with applications to sequential 
analysis I. Ann. Statist. 5 946-954. 

Lal, T. L. and SIEGMUND, D. (1979). A non-linear renewal theory with applications to sequential 
analysis II. Ann. Statist. 7 60-76. 

LORDEN, G. (1970). On excess over the boundary. Ann. Math. Statist, 41 520-527. 

PoLLak, M. (1978). Optimality and almost optimality of mixture stopping rules. Ann. Statist. 6 
910-916. 

POLLAK, M. (1983). Average run lengths of an optimal method of detecting a change in distribution. 
Technical Report, Dept. of Statistics, Stanford Univ. 

POLLAK, M. and SIEGMUND, D. (1976). Approximations to the expected sample size of certain 
sequential tests. Ann. Stahst. 3 1267-1282. 

POLLAK, M. and S1zcMUND, D. (1985). A diffusion process and its application ic detecting a change 
in the drift of Brownian motion. Biometrika 72 267-280. 

Rossing, H. (1970). Statistical methods related to the law of the iterated loganthm. Ann. Math. 
Statist. 41 1397—1410. 

Strong, C. (1965). Moment generating functions and renewal theory. Ann. Math. Statist. 36 
1298-1301. 

WOoo0DROOFE, M. (1976). A renewal theorem for curved boundaries and moments of first passage 
times. Ann. Probab. 4 67-80. 

WOODROOFE, M. (1982). Nonlinear Renewal Theory in Sequential Analysis. SIAM, Philadelphia. 


DEPARTMENT OF STATISTICS 
HEBREW UNIVERSITY 

91905 JERUSALEM 

ISRAEL 


The Annals of Statistics 
1986, Vol. 14, No. 3, 1080-1048 


THE SHAPE OF BAYES TESTS OF POWER ONE? 


By Hans RUDOLF LERCHE 
University of Heidelberg 


The problem of determining Bayes tests of power one (without an 
indifference zone) is considered for Brownian motion with unknown drift. 
When we let the unit sampling cost depend on the underlying parameter in a 
natural way, it turns out that a simple Bayes rule is approximately optimal. 
Such a rule stops sampling when the posterior probability of the hypothesis is 
too small. 


1. Introduction. Let W(t) denote Brownian motion with unknown drift 
ð ER and P, the associated measure. We consider the following sequential 
decision problem. Let F be prior on R given by F = yd, + (1 — y){.(vr @)vr dé 
with 0 < y < Land $(x) = (1/ Vme” /2, consisting of a point mass at {9 = 0} 
and a smooth normal part on {6 # 0}. Let the sampling cost be c6*, with c > 0 
for the observation of W per unit time when the underlying measure is P,. We 
assume also a loss function which is equal to 1 if 0 = 0 and we decide in favor of 
“0 #0” and which is identically 0 if 0 + 0. A statistical test consists of a 
stopping time T of Brownian motion where stopping means a decision in favor of 
“6 #0.” 

The Bayes risk for this problem is then given by 


(1.1) p(T) = yP(T < 0) + (1 - y)e Je 0?E,To(VrO)VF dd. 


In this paper we investigate the “optimal” stopping rule T* which minimizes 
p(T). 

For the cost c sufficiently small, T,* is a test of power one for the decision 
problem H: 6 = 0 versus H,: 0 + 0. This is by definition a stopping time T 
which satisfies the conditions 


(1.2) PT < o) <1, 
(1.3) P,(T<0o)=1 if6+#0. 


Here stopping also means a decision in favor of “@ + 0.” For a discussion of tests 
of power one see Robbins (1970). A similar problem has been studied by Pollak 
(1978) who assumed an indifference zone in the parameter space. The type of 
prior assumed here was once proposed by Jeffreys (1948). 

A basic idea of this paper is to let the sampling cost depend on the underlying 
parameter in a natural way. At the first view the cost term “c0?” has an unusual 


Received May 1984; revised October 19865. 

'This work was supported by the Deutsche Forschungsgemeinschaft and the National Science 
Foundation at MSRI, Berkeley, under NSF Grant MCS81-20790. 

AMS 1980 subject classifications. Primary 62L15; secondary 62C10. 

Key words and phrases. Sequential Bayes testa for composite hypotheses, tests of power one, 
simple Bayes rules. 


1030 


BAYES TESTS OF POWER ONE 1031 


structure. The factor 07/2 is the Kullback—Leibler information number 
E,log( dP, ,/dP,,) which quantifies the separability of the measures P, and P}. 
Its meaning becomes apparent by the following consideration. Let us consider 
two testing problems with simple hypotheses: 


(1) Hy: 8 = 0 versus H,: 6 = 6, 
(2) Hy: 0 = 0 versus H,: 6 = 6, 


with 6, > 0, i = 1,2. Let ¢,, 2 = 1,2 denote the sampling lengths. Then the level-a 
Neyman—Fearson tests for both problems have the same error probabilities if and 
only if 674, = 02t. [This follows from the power function of a Neyman—Pearson 
test of level a: ®(—c, + Oyt).] Thus the factor 0? standardizes the sampling 
lengths in such a way “that the embedded simple testing problems are of equal 
difficulty. Beside this statistical aspect there is a basic mathematical reason 
for this choice of the sampling costs. Since in our decision problem (1.1) an 
indifference zone does not occur and since E T = œ [as P(T < œ) <1] we 
have lim,_,)HgT = œ. More information about the singularity is provided 
by a lemma of Darling and Robbins (1967) [see also Robbins and Siegmund 
(1973) and Wald (1947), page 197]. It states that for every stopping time T with 
P(T < 0) <1 


(1.4) ET 2 2b/6*, where b = —log P(T < oo). 
Equality in (1.4) holds for the special stopping rule 


dP, ; 
>e’). 
dP, , 


Here dP, ,/dP,,, denotes the likelihood ratio (Radon-Nikodym derivative) of P, 
with respect to P, given the path W(u), 0 < u < t. It is given by 





(1.5) T= int(e> 0 





dP; 


dPy, t 


According to (1.4) the expected sample size E,T of a test of power one, 
considered as a function of # has a pole at @ = 0. The choice of “c” or “eop” 
instead of “c0?” would imply that tests of power one have an infinite Bayes risk 
since 





= exp(6W(t) — 1672). 


fi'EsTo(Vr@)Vr dd = co for i= 0,1. 


A precise description of the pole of E,T is given by Robbins and Siegmund (1973) 
and Jennen and Lerche (1982). The sampling costs “c0?” remove the nonintegra- 
bility of the singularity of E,T for a large class of tests of power one, although 
lim _, 99’E,T = œ still holds [by the corollary on page 102 of Robbins and 
Siegmund (1973)]. For instance for all tests of power one defined by 


T = inf{t > 0||W(t)| = ¥(t)}, 
where the function y(t) is concave and (tf) = o(t?/3~*) when t > œ (with e > 0 


1032 H. R. LERCHE 


arbitrary small), the Bayes risk (1.1) is finite. This follows from the inequality 
JOET < ~(E,T), which is a consequence of Wald’s lemma and Jensen’s in- 
equality. Therefore by the choice of the sampling costs as “c0?” the concept of 
Bayes tests of power one becomes an interesting topic to study. 

The related problem for simple hypotheses can be solved easily. The Bayes 
risk given by 


(1.6) p(T) = yP,(T < œ) + (1 — y)c6°E,T, 
using statement (1.4), is minimized by the stopping rule 
(1.7) T,* = inf{t > 0|W(t) > log a/0 + 462} 


with a = y(2(1 — y)c)~! provided a > 1. In this case the minimal Bayes risk is 
given by 

(1.8) e(T*) = 2(1 — y)c[loga + 1]. 

When a < 1, T* = 0, and p(T*) = y. (For more details see the end of the proof 
of Theorem 2.) Here the choice of the sampling costs leads to a solution not 
depending on 8. This becomes obvious when one expresses T,* in another way. It 
can be rewritten as 





revi anton « #5) 


where 
Y 


yay) 


y(x, t) E dP; t 
—(x) 


dh, t 


denotes the posterior mass of the parameter “0” at (x, ¢) with respect to the 
prior F = y5) + (1 — y)dy. Thus T* has the intuitive meaning “stop when the 
posterior mass of the hypothesis “0” is too small”. This is a simple Bayes rule or 
equivalently the one-sided sequential probability ratio test (1.5). 

The following study shows that a simple Bayes rule which stops when the 
posterior probability of the hypothesis “@ = 0” is too small, is approximately 
optimal for the risk (1.1). For precise statements see the Theorems 2 and 3 and 
the corollaries. This simple Bayes rule is of the type (1.5) with a boundary equal 
to 





t+r Y 


y(t) = (e+ rie oye 


For large ¢ this boundary asymptotically grows like (t log t}, which is faster 
than the limiting growth rate (2t log log t)! of the law of the iterated logarithm. 
As a consequence of our results the minimal Bayes risk can be approximated by 
that of simple Bayes rules within o(c) when c > 0 (Theorem 4). Simple Bayes 
rules for the same type of prior which we use (Jeffreys’ priors) were already 
discussed by Cornfield (1966). 

Similar results hold for exponential families with general priors, although one 
has to make a careful analysis of the overshoot effect following the ideas of 





1/2 
| + 2108 ]] , where b = 


r 


BAYES TESTS OF POWER ONE 1033 


Lorden (1977) to derive an o(c)-approximation for the minimal Bayes risk. The 
results will be published elsewhere. The proofs for the case of exponential families 
are more technical, since special approximation arguments are needed. The nice 
feature of the Brownian motion case is, that most expressions can be calculated 
exactly. Thus no approximations are needed and the proofs become simple. 

This paper is organized as follows: Theorem 1 states the existence of an 
optimal (Bayes) stopping rule T*. Theorem 2 gives upper and lower bounds for 
T.*, which makes it possible to derive its asymptotic shape when c > 0 or t > œ. 
Theorem 3 refines these bounds, which yields the above mentioned o(c)- 
approximation of the minimal Bayes risk. Theorem 5 treats the one-sided case. 

The results have some meaning for sequential clinical trials. These aspects are 
discussed in more detail in a subsequent paper. Historical facts are mentioned in 
Lerche (1985). A further result connected with the costs c0? is the exact Bayes 
property of the repeated significance test. For that see Lerche (1985, 1986). 


2. Preliminaries. We need the following notations. The Brownian motion 
W with drift 6 starting at time t in point x is understood as a measure P{*:”) on 
the space C[t, 00) of continuous functions on [t, 00). ¥,' denotes the o-algebra on 
C[t, 00) which is generated by W(u), t < u < s. The restriction of the measure 
P§*) on FÆ is denoted by P$}; ”. This notation is also used for stopping times S 
instead of fixed times s. When the process starts at 0 at time 0, then we very 
often skip the superindex and write just ¥,, P, , etc. The Borel o-algebra on the 
parameter space R is denoted by #. For F = y5) + (1 — y){vro(vr0) dé, let 
dP = dP,F(d@) and dP = d( {P,F(d@)) be its projection. Let F, , denote the 
posterior distribution given that the process W(t) = x. This means that for 
AXBEF,8B [Fy (B)P(dW) = P(A x B) holds. Thus the Bayes risk 
(1.1) can be rewritten as 


(2.1) p(T) = f (E0) + eT f” 0°Fwm,r(d8)) dP. 


Let P” denote the conditional distribution of the process under P given 
W(t) = x. It can be represented as P = [P> PF, (d0). 

We define the posterior risk at the space-time point (x, t) for a stopping rule 
T > t as 

p(x, t, T) = i= (Fure.2((0}) + c(T- t) 
(2.2) 
xf Fy, r(48)| dPe:9, 

The minimal posterior risk at (x, £) is defined as 
(2.3) p(x, t) = infp(x,t, T), 


where the infimum is taken over all stopping times of the process (W(s), 8) 


1034 H. R. LERCHE 


starting at (x, t), including T, = t. For T, the risk is given by 

(2.4) y(x, t) = p(x, t, T,) = F, ({0}) 

and therefore the inequality p(x, t) < y(x, t) holds. The quantity p(x, t, T) + 
ctf0?F, (d0) represents the loss when the process runs without stopping up to 
(x, t) and is stopped at T > t. 

The following theorem states that an optimal (Bayes) stopping rule exists 
which minimizes (2.1) and characterizes it. Let €*(c) = {(y, 8)|p(y, 8) < y(¥, 8)} 
and 
(2.5) T* = inf{s|(W(s), s) £ @*(c)}. 


THEOREM 1. The stopping rule T,*(> t) of the space-time process (W(t), t) 
minimizes the risk (2.2) for all starting points (x, t). 


This type of result is well known. Its statement is usually called the principle 
of dynamic programming. The result follows from the theory of optimal stopping 
for Markov processes [cf. Shiryayev (1978), page 127] applied to the space- 
time process (W(t), t) We note that W(t) under the measure P is a dif- 
fusion process which satisfies the stochastic differential equation dW(t) = 
(1 — y(W(t), t))W(t)/(t + r) dt + dX(t) where X(t) is a standard Brownian 
motion [cf. Liptser and Shiryayev (1977), page 258]. 

The stopping risk can be calculated by (2.4) as 


Y 
(2.6) WenS yee) 
with 


wo dP, 
(x,t) = f gp (FO) d0 
(2.7) = fJ? los — 10?t)o(Vr0)vr d0 


F x? 
5 el aoa 


We note that on {8 + 0} 
(2.8) F, (d0) = (1 — y(x, t))G, (d8), 


where 











x 1 
Gs = N| , 
4 t+r t+r 
holds. 


Here N(p, o?) denotes the normal distribution with mean u and variance o°. 
The exact calculation of the minimal posterior risk p(x, ¢) seems to be impossible 
for this problem. We can only derive upper and lower bounds for it. To get those 
we will rewrite the posterior risk in an appropriate form. 


BAYES TESTS OF POWER ONE 1035 


LEMMa 1. 
p(x, t, T) = y(x, t)P%(t< T< œ) 


(2.9) +(1~ r(x, t))e JOEF T — t)G,, (a8). 


The posterior risk has the same form as the Bayes risk (1.1), with the slight 
difference that the process starts in the space-time point (x, t), stops at T > t, 
and has as prior F, , = y(x, £)5, + (1 — y(x, £))G, ,, the posterior at the point 
(x,t). 

The proof of the lemma is a direct consequence of the preceding definitions 
and of the following basic fact about posterior distributions: the posterior of 
Brownian motion starting at (x, ¢) with prior F, , at point (W(S), S) is given by 
Fws, s 


3. Results for two-sided tests. The continuation region ¢ *(c) of the opti- 
mal stopping rule for the Bayes risk (1.1) is now approximated by upper and 
lower bounding regions of the space-time plane. These bounds are refined 
in Theorem 3. The bounding regions are given by sets of the type 
ECA) = {(x, t)ly(x, t) > A). 


THEOREM 2. There exists a constant M > 2 such that for every c > 0 


(3.1) e|: EI c #*(e) c e|: =| 


holds. 








REMARK 1. Let T, = inf{t > O(W(t), t) € @(A)}. Then (3.1) translates to 
Tyes+Mc) < Tf S Tac/a +20). 


REMARK 2. The theorem holds also for the more general prior 
Vre(vr (0 — u)) d0 


by exactly the same arguments. 


Proor. At first we prove the lower inclusion of (3.1), which is the more 
difficult part. We show that for all points (x, t) E @(Mc/(1 + Mc)) (M will be 
specified during the proof) there exist stopping times S, p of the process 
(W(s), s) starting at (x, t), such that 


(3.2) p(x, t, Se.) < v(x, t) 


holds. 
Since by definition p(x, t) < p(x, t, S, »)), it follows from (3.2) and Theorem 1 
that (x, t) € @*(c). We choose the stopping times as 


Sic, = inf{s > tly(W(s), 8) < Qc}, 


1036 H. R. LERCHE 


where the constant @ > 1 will be defined below in such a way that Qc < 1. [In 
fact the stopping times S,,, all arise from the same stopping time Ty, by 
changing the starting point of the process.] We need several representations of 
Sıx, n during the proof. 


Six, 2) = inf{s > t| Fwa), .({0}) <s Qc} 


























| dP$*; ,t) 
= inf(s >t (d0) > b(x, t) 
IPE: “Ip, t) Gz: 

(3.3) 

ni i t+r 1/W(s) x? ‘ 

E ii str P\2| err.  t+r = b(x,t)}, 
where 

ap- EDU- @e) 


(1 zj y(x, t))Qe f 


The first equality holds by definition. The second equality follows by the 
calculation 


y(x, t) 


Swaal ON = a+ enaa] 


with 


h(x,t,w,s) = dð). 


(Fes aaea 


The third equality follows by the following calculation [note here that G, , = 
N(x/t + r),1/(t+ r), 


dP» 


Jas Pai 


= fexp(OWs) ~ x) = 40°C = 9) 
(t + r) 7 x 2 
e| - > (2- J} 20 
= ((t+r) a| - a p Sm) — 167(5 + a 


t+r f | 
= exp; — . 


str 2 
~ We start now to estimate the posterior risk for S, , which is given according to 


(a0) 











W(s)? x? 
8 +r t+r 

















BAYES TESTS OF POWER ONE 1037 


Lemma 1 by 


o(x, t, Siz, 1) = y(x, A) P(t < Sze) < 00) 

(3.4) 
+(1 7 y(x, t))e f0? (Su, t) t) dQ 
with the new notation Q (dW, d@) = P= (dW)G,, (d8). We will also use 
Q= dW) = fe P{* °(dW)G,, (d0) and will from now on simply write S in- 
stead of S+, n. Then we get for the first term 

Qe l- y(x, t) 
l= Qe y(x, t) 


This follows from a well known martingale argument [see Robbins and Siegmund 
(1970), Lemma 1] by using (3.3): 


(3.5) P&(t<S <0) = = b(x, t). 





dP} t) 
PMe<S<s)=f seks agen 


(t<S<a) dQ 
= (x,t) (Qt <S < so}. 


Since Qt <S <s} +1 as s> 0 (3.5) follows. We note that for 
(x, t) E (Qc), b(x, t) > 1 holds and that thus the probability in (3.5) is less 
than one. 

To estimate the second term of (3.4) we rewrite the integral. Since on Ff we 
have 


Qe (aw, a8) = MT, go (agaw), 


Str’S 
we get for the integral 
JOAS- t) dQe = [67/5 +r) dQer— far(e+r) age? 
1 
= ffor(s+r ON Ss WLS) IT zy on a= t) 


(3.6) 
- fore + r)G, (d0) 


z J| WSP _ x? 


S+r t+r 





|e 2 


Using now the third form in (3.3) of the stopping rule S yields 








Sir t+ 


(3 .7) (Fs -- | dQ‘: ) 2log b(x, t) 4 fod == z) E 2 


1038 H. R. LERCHE 
Let a > 3. We show now that there exists a constant 0 < C, < o with 
l/a 
(x,t) 2 = (x, t) 
(3.8) fice = =) a <c fo (S-t)dQ i 
Then (3.6), (3.7), and (3.8) yield 
1/a 
(3.9) fos — t) dQ < 2log b(x, t) + o,| fers — t) age) , 
from which one can derive (3.2), as will be explained below. 


The proof of (3.8) runs as follows. By using the inequality log(1 + x) < K,x/* 
for x > 0 we get by Hölder’s inequality 


f g| 2 "| age = f log(1 + A do. 
SK, (Z= igen 
(3.10) = K, f (0S - 6) (0t +r)" aqe 
K,| fos - 1) age)” 
x [f $ ry MG, (a0)| 


But since G, , = N(x/(t + r),1/(t + r)) we get for a > 3 


fare r)) “OG, (8) < frea my “nfo, ——] a0) 


(a—1)/a 


= fy?" YN(0,1)(dy) < o. 


We now put 


(a-1)/a 
O, = K| for PN(0,1)(a)| 


and get finally (3.8) and (3.9). 
Let b > 1 be given. Then by (3.9) there exists a constant B > 2 such that for 
all b(x, t) > b 


(3.11) for(s — t) dQ” < Blog b(x, t) 


holds. Now we choose Q = B/(1+ Bc) and M = bB. Then for (x,t)¢€ 
€(Mc/(1 + Mc)) we have 


Ge ate) TEE 
ARES ae e a Be 





BAYES TESTS OF POWER ONE 1039 


and by (3.4), (3.5), and (3.11) we get further 
p(x, t,S) < y(x, t)b(x, t) + (1 — y(x, t)) Be log b(x, t) 
= y(x, t)b(x, t)~*(1 + log b(x, t)) 
< y(x, t). 
The last inequality follows from the inequality x(1 + logx~!) < 1 for x < 1 since 


b(x, t) > b > 1. This proves (8.2). 
Now we prove the upper inclusion. We show 





(3.12) p(x, t) = y(x, t) if y(x,t) < ETA 


This implies the upper inclusion of statement (3.1). The method of proof consists 
in comparing the Bayes rule T* with the best rule if 8 were known. 
For the Bayes rule T* we always have 


y(x, t) = p(x, t) = p(x, t, T*) 
= y(x, t) PS (ts Te < œ) 


+(1 - y(z, t))e JOTE — t) dQ 
= f” [ve Pes Te < 0) 


+(1 — y(x, t))c6°E f(T * — t)|G, (dé). 


Let the process W start in x and let Wu) =z. Under the transformation 
y = 0(z — x), s = 6%(u — t) Brownian motion with drift 8 (resp. 0) goes over into 
Brownian motion with drift 1 (reap. 0). With S, = 67(T.* — t) we get 


= f” [v(a PEO < S < o) + (1 — v(a, #)) cB PS] G,, (a8) 

> inf [y(x, t)PO(0 < 8< æ) + (1 — y(x, t)) CEOS] = p(x, t). 
But A(x, t) is the minimal Bayes risk of (1.6) with y = y(x, t). We determine now 
its Bayes stopping set. By (1.4) 

p(x,t) = min (yp + 2(1 — y)clog p~*) 
Ospsl 
= Ypo + 2(1 — y)c log po’ 

with po = (2(1 — y)c)/y A 1( pọ denotes the stopping probability). Thus 


and 


2(1 — y)}e 
Reopen Pee ge 


1040 H. R. LERCHE 


The Bayes stopping region is therefore equal to 
{(x, t)|y(x, t) = B(x, t)} = {(x, t) [y(x £) < yo}, 
where Yo = 2c(1 + 2c)~1yq is determined by the equation 
_ 21 — y)e 


Po = OO 10 
Yo 


We now derive a refinement of the statement of Theorem 2. For this we need a 
somewhat more general notation. If A(t) is a positive function of time we shall 
denote by @(A(-)) = {(x, é)ly(x, t) > ACO}. 


THEOREM 3. For every c > 0 there exists a bounded function ¢(-) > c with 


(a) &(t)/e > 1 when t > œ for every fixed c, and 
(b) supp <;<€(t)/¢ > 1 when c > 0, such that 
2é(-) 


(3.13) al Cc €*(c) Cc e| 





1+ z] 
holds. 


The theorem states that for c small or ¢ large the optimal stopping region is 
very near to its upper bound €(2c/(1 + 2c)). The proof of Theorem 3, which is 
deferred to the end of this section, will show that the upper bound of @(-)/c is a 
bit larger than M/2 where M is the constant appearing in Theorem 2. 

Several conclusions can be drawn from the theorem. Let »*(t) = 
inf{x > O|p(x, t) = y(x, t)}. By Theorem 2 this definition makes sense. Thus by 
the symmetry of the problem 


TŽ = inf{t||W(t)| = we(t)}. 
COROLLARY 1. 


y(t) = CG r) 





lo : + 2lo +o K 
when t > œ. 


COROLLARY 2. 





y(t) = ke +r) 


uniformly in t when c > 0. 





1/2 
eR + og — "| + o0) 


COROLLARY 3. For every e > 0 there exists a cy > 0 such that 
Thea+esatecit+e ST" S Teeyasee) foralleg 2 c> o0. 


BAYES TESTS OF POWER ONE 1041 


We can combine Corollary 3 with some recent results about boundary crossing 
distributions to get the minimal Bayes risk for (1.1) up to an o(c)-term. A related 
O(c)-result for the Bayes risk has been obtained by Pollak (1978), when there is 
an indifference zone in the parameter space. 

THEOREM 4. 

(3.14) 0 < p(To-41+20)) — p(T“) = 0(e) whene > 0. 

The minimal Bayes risk for (1.1) is given by 

(3.15) p(T.*) = 2(1 - y)c[log 6 + Hoglog b + 1 + Hog2 — A + o(1)] 
when c > 0. Here 


y 


beie and A=2f log x (x) dx. 


REMARK. Comparing statement (3.15) with the related formula (1.8) for the 
simple testing problem shows that the additional term 2(1 — y)c[}log(2 log b) — 
A + o(1)] appears in the minimal Bayes risk. This is caused by the ignorance 
about the parameter 6 + 0. 


Proor. From Corollary 3 it follows 
P(Tearat2a+e) 
(3.16) E A Tasca eh < 00) = Py Trepi+ee) < 0) | 


Ss p(T*) s P(Tzeya +20) 


We now show that the right- and left-hand sides of (3.16) differ from each other 
only by a o(c}term. Formula (3.5) yields 


(3.17) P(T asoa < 00) E P (Treza +20) < 00) seb! = O(ec). 


Now we compute p(T)... +.2.))- We write from now on for simplicity T instead of 
Tee/a +20 By (8.5) and (3.7) for x = 0, t= 0 





(3.18) o(T) = (1 - veli eirbe 5 foe = “| agl. 


The integral on the right-hand side can be calculated by using Theorem 5 of 
Jennen and Lerche (1981). The following result is intuitively plausible by virtue 
of the relation 


P,{ T/log b > 20-7} =]. 
We note 


(3.19) f oe eu 


r 





| dQ = log(2log b) — 2A + o(1). 


1042 H. R. LERCHE 


Combining (3.19) with (3.18) yields 
log(2log b) 
(3.20) p(Toe4142)) = 2(1 — y)c|log b + eae a A + o(1)}. 


From (3.17) and (3.20) it follows also that 


(3.21) PTa iiih) = (Tos racde) + O(ec). 
Statement (3.21) together with (3.16) and (3.17) yields (3.14) and (3.14) together 
with (3.20) yields (8.15). 0 


PROOF OF THEOREM 3. The upper inclusion of (3.13) is already proved by 
(3.12). Now we prove the lower one. For the stopping times 





(3.22) AE int(s > thy(W(s), s) < — “a 


we show that i 
2č(-) Mc 
(3.23) p(x, t, S65) = y(x, t) for (x, t) € «| 1+ 2é(-) Nel 1+ am 


where ¢(-) will be specified below. M is the constant of Theorem 2. Then (3.1) 
together with (3.23) implies the lower inclusion of (3.13). 

Now we define @(t). We note that for the stopping times (3.22) by (3.9) with 
a = 4 and 





y(x, t) 


Bat) = D 


the inequality 


1/4 
(3.24) fOUS- t)dQ@ < 2log b(x, t) + C(x, n| fors - t) ag] 
holds. The constants C(x, t) are given by 
3/4 
C(x, t) = K| f(0%: + r))"“*@,, (a8)| 
(3.25) 
3/4 
2 r| [y NONE 7 .1)(dy)} 


with 6 = x/(t + r). 

Let YI and 47 denote the positive and negative branches of the solution of 
the implicit equation y(Wi(t), t) = Mc/(1 + Mc). By symmetry ył = ty. 
where y, is given by 


1/2 
v(t) = le nie) + TA J 


We choose 


(3.26) e(t,c) = ~log(1 — C(y,(t), t)'") A log M/2 


BAYES TESTS OF POWER ONE 1043 


and put ¢(t) = cexp(e(t,c)). Let a>1. Let d(a) = inf{y > lla log(ay) < 
ay — 1}. d(a) is uniquely determined. We define ¢(¢) = d(¢(t)/c)c(t). 

Now we claim that @(-)/e has the demanded properties (a) and (b). By (3.25) 
C(x, t) depends only on byt + r|=(|x|/(¥t +r). Evaluating \oVt + r| at the 
graphs (+t), t) yields 


À Y 172 
lĝvyt +r] = lot + r) + 210e z ; 


which tends to infinity, uniformly in t when c —> 0, or when t > œ. 
Consequently, C(+4 dt), t) > 0 and therefore by (3.26) e(t, c) > 0, uniformly 
in £ when c > 0, or when £ > oo. Since d(a) > 1 as a > 1 the properties (a) and 
(b) follow. 
Now we show (3.23). As a first step we prove 


(3.27) c f 02(S — t) dQ” < 22(t)log b(x, t) 


% 2¢(-) ” Mc 

(4) © © or) ra): 

By (3.26) we can assume that ¢(t) < Mc/2. Let H(x, t) = JOS — t) dQ”. 
Then we have from (3.6), (3.7), and (3.24) (with the x and ¢ variables suppressed) 
(3.28) 2log b < H < 2log b + CH™. 


Let C,(t) = C(4 t), t). Then 0 < C(x, t) < C(t) holds on @(Mc/(1 + Mey: 
and therefore 


for 





(3.29) (1 — C,\/H®4)H < 2log b. 
If 

(3.30) b(x, t) > exp(iC,(t)) 
holds for 





(x,t) E € 





2é(-) g Mc 
1+2(-) (rr) 
then we get from the left-hand side of (3.28) C,(t) < H(x, t) and therefore from 
(3.29) 
H(x, t) < 2log(b(x, t))(1 - C(e). 
But this yields (3.27). 
It is left to show that (3.30) holds. Let 


c(t) = a /exp(3C,(t)). 


t= C(t) 


1044 H. R. LERCHE 


An elementary calculation shows that c(t) > c for 0 < C,(t) < 1. Then 


Hano y(x, t) 
2(1 — y(x, t))e ~ 2(1 — y(x, t))e(t) 


_ 1x, tJeplC(t)) 
2(1 ~ y(x, t))e(t) 
> exp(4C,(t)). 
The second equation holds since 


Il 








b(x, t) 


E(t) = e(1 — C(t) < M2, 


and the last inequality follows from the definition of ¢(2¢(-)/(1 + 2¢(-)). This 
proves (3.30) and completes the proof of (3.27). 

Combining now (3.4) and (3.5) with (3.27) yields for the stopping times (3.22) 
the estimate for the Bayes risks 


(3.31) p(x, t, S) < y(x, t)b(x, t) + 2(1 — y(x, t))é(t)log b(x, t) 


with 
y(x, t) 2¢(-) Pee 
b(x, t) = Al — yede on (S25)\4 1+ rat 
We assume now that 


2é(-) Mc 
eS «| 1 + 24(-) Ael 1+ oA 


and estimate the right-hand side of (3.31) further. It is equal to 


y(x, t)b(x, t) [1 + (é(t)/c)log (x, t)] 





(3.32) 
= y(x, t)[A(x, t)é(t)/c] [L + (c(t)/c)log( h(x, t)@(t)/c)] 


with A(x, t) = y(x, t)/(20 — y(x, t)e(t)). Since (ay) (1 + a log(ay)) < 1 for 
y > d(a) and since on €(2&-)/(1 + 2&(-)), A(x, t) > d(c(t)/c) by the definition 
of č, it follows that expression (3.32) is strictly less than y(x, t). This yields (3.23) 
and completes the proof of Theorem 3. 0 


4. Results for one-sided tests. In this section we consider the Bayes risk 
given by 


(4.1) p(T) = yP,(T < œ) + (1— y)e J "92E To(Vr0)2VF dê. 


For it we can characterize the minimizing stopping rule T (it also exists) by 
results similar to those for the two-sided case. 


BAYES TESTS OF POWER ONE 1045 


If not mentioned otherwise we will use the same notation as in the preceding 
sections for the corresponding objects here. For instance, 


y(x, t) = F, ({0}) 
= b+ = exp(6x — 197t)o(Ovr )2Vr d0 e 





and also p(x, t, T), p(x, T), @*(c), €(Kc), etc. The prior on [0, œ) is given by 
= 76, + (1- y) [o(/ra)avr dð. 


The posterior at (x, t) can be represented as 
Fa = y(x, t)8o + (1 a y(x, t))H, 0 


where 


Heh EE SS 

BET (> = (| 

on (0, œ). We only state the analogous result to Theorem 2. The counterpart to 
Theorem 3 holds also and can be proved in exactly the same way as Theorem 3. 


THEOREM 5. There exists a constant K > 2 such that for every c > 0 


(4.2) e|- c e*(c)c e| i =|. 





Proor. The proof of the upper inclusion of (4.2) runs exactly along the same 
lines as that of (3.1). For the lower inclusion we show that for all (x, t) € 
@(Kce/(1 + Kce)) there exists a stopping time S, of the processes (W(s), s) 
starting at (x, t) such that 


(4.3) p(x, t, Sx.) < (x, t) 

holds. Let Q denote a constant which satisfies Qe < 1. We choose 
Six, = inf{s > tly(W(s), 8) < Qc}, 

which can be rewritten as 


ži t) 


FZ IPED zH, (db) = b(x, o| 


(4.4) ar ime e of 
=inf(s>t oo 5 - ae > b(x, t) 


s+r t+r 


Six.) = nt >t 





























1046 H. R. LERCHE 
with 
y(x, t)(1 = Qe) y 
-—— n and O(y) = (x) dx. 
oe ies O) = f° 9) 
The posterior risk at (x, t) can be represented as 
p(x, t, Sz, D) = y(x, t)P O(t < S, < 0) 
+(1— v(x, t))e f PEFS u,n — t) Hy, (d0). 
0 


From here on we write S instead of S}. The same martingale argument as 
for (3.5) yields 


b(x,t) = 


(4.5) 


P(t < S< œ) = B(x, t) 
The estimate of the other part of the Bayes risk (4.5) is a bit more complicated 


than that of the corresponding part of (3.4). It can be expressed after some 
calculations similar to those of (3.6) as follows: 











ao S-on dan- S FE Fae] MOY 
W(S) -s x ee 
+ flere -A ETF | ION 








with 
h(y) = 99(y)/O(y) and GEO = f PEH, (d0). 
Using the defining equation (4.4) of the stopping time S yields 
I “9E A(S — t)H, (d0) 


+r 
r 





= 2log b + I loe( = "| dQ) 


t+ 
lg) ae) 


“allel goer) eee) 


with g(y) = log ®(y). 
Now after some calculations we get 




















Maser) Neer) Aer) lel 
(4.8) = SERBES) = 28O) dy 
wiser CY) 


= —(1 + y7?) — h(y)| dy. 
But this integral is always negative. It is obvious that the integrand is negative 


BAYES TESTS OF POWER ONE 1047 


for positive values of y. That it is also negative for negative y-values can be seen 
as follows. We have to show that 


(4.9) ~ (1 +y?) — y6(y)/O(y) <0 for y <0, 
which is equivalent to 
-(1+y?)(1- O(y)) + ¥¢(y) <0 for y>0 
and to 
(4.10) 1- (y) >= (y/(+y?))o(y) for y> 0. 
Both sides of (4.10) vanish at y = œ and the derivative of the left-hand side is 
always smaller than that of the right-hand side and both are negative, i.e., 
-o(y) < —o(y)(1 - 2/(1 + y?y) forall y > 0. 


This yields (4.9) and therefore the integrand in (4.8) is always negative. 

It is left to show that W(S)/ YS + r > x/ Yt + r. Now let K/(1 + Ke) > Q. 
Then (x, t) € €(Kc/(1 + Kc)) implies y(x, t) > Qc, which yields b(x, t) > 1. 
This together with (4.4) implies for S > ¢ the inequality 


x a [tFr ws)’ 5 W(S) 

ee ———— < — as P 
me 2(¢ +r) (F) Sr P 2(S +r) VS +r 

Since the function \ ~ e”/(A) is increasing this yields W(S)/ VS +r > 

x/ yt + r. Thus the expression in (4.8) is always negative, which by (4.7) yields 


S+r 
t+r 





2 











f PEFS ~ Ł)H, (d0) < 2log b + five| dQ), 
0 


The rest of the proof is similar to that of Theorem 2 from (3.7) on. O 


Acknowledgments. The author wishes to thank W. Ehm, C. Jennen, I. 
Johnstone, C. Klaasen, S. Lalley, T. L. Lai, D. W. Müller, T. Selke, and D. 
Siegmund for stimulating discussions and the Associate Editor and the referees 
for their careful work. 


REFERENCES 


CORNFIELD, J. (1966). A Bayesian test of some classical hypotheses—with application to sequential 
clinical trials. J. Amer. Statist. Assoc. 61 677-594. 

DARLING, D. and Rossins, H. (1967). Iterated logarithm inequalities. Proc. Nat. Acad. Sci U.S.A. 
57 1188-1192. 

JEFFREYS, H. (1948). Theory of Probabuty. 2nd ed. Clarendon, Oxford. 

JENNEN, C. and LERCHE, H. R. (1981). First exit densities of Brownian motion through one-sided 
moving boundaries, Z. Wahrsch. verw. Gebiete 55 138-148. 

JENNEN, C, and LERCHE, H. R. (1982). Asymptotic densities of stopping times associated with tests 
of power one. Z. Wahrsch. verw. Gebiete 61 501-511. 

LERCHE, H. R. (1985). On the optimahty of sequential tests with parabolic boundaries. In Proc. of 
the Berkeley Conference ın Honor of Jerzy Neyman and Jack Kiefer. (L. Le Cam and R. 
Olshen, eds.) 2 579-597. Wadsworth, Monterey, Calif. 


1048 H. R. LERCHE 


LERCHE, H, R. (1986). An optimal property of the repeated significance test. Proc. Nat. Acad. Sct. 
U.S.A. 83 1546-1548. 

LIPTSER, R, S. and SHIRYAYEV, A. N. (1977). Statistics of Random Processes 1. Springer, Berlin. 

LORDEN G. (1977). Nearly-optimal sequential tests for finitely many parameter values. Ann. Statist. 
§ 1-21. 

POLLAK, M. (1978). Optimality and almost optimality of mixture stopping rules. Ann. Statist. 6 
910-916. 

Rossins, H. (1970). Statistical methods related to the law of the iterated logarithm. Ann. Math. 
Statist. 41 1397-1409, 

Rossins, H. and SIEGMUND, D. (1970). Boundary crossing probabilities for the Wiener process and 
partial sums. Ann. Math. Statst. 41 1410-1429. 

ROBBINS, H. and SIEGMUND, D. (1973). Statistical tests of power one and their integral representa- 
tion of solutions of certain partial differential equations. Bull. Acad. Sinica 1 93-120. 

SHIRYAYEV, A. N. (1978). Optimal Stopping Rules. Springer, Berlin. 

WALD, A. (1947). Sequential Analysis. Wiley, New York. 


UNIVERSITAT HEIDELBERG 

INSTITUT FUR ANGEWANDTE MATHEMATIK 
IM NEUENHEIMER FELD 294 

6900 HEIDELBERG 1 

WEST GERMANY 


The Annals of Si 
1986, Vol 14, No a. 3, ine 1067 


VERY WEAK EXPANSIONS FOR SEQUENTIAL 
CONFIDENCE LEVELS! 


By MICHAEL WOODROOFE 
University of Michigan 
Asymptotic expansions are derived for a class of averages of the coverage 
probabilities for some sequential confidence bounds, when the data consist of 
ii.d. observations from a one-parameter exponential family. These expansions 
show the effect of the optional stopping on the coverage probabilities quite 


clearly and provide a method for changing the confidence limits to reduce this 
effect. 


1. Introduction. Let X,, X,,... denote i.id. random variables whose com- 
mon distribution function F, depends on an unknown parameter w € Q. Suppose 
that each F, has a finite mean 8 = 6(w) and a finite positive variance o° = 07(w), 
and consider aa ars of pereng confidence bounds for @. For each integer 
n z1, let ĝ, = X, = (X, + +X,„)/n; and let ôa = balKi.. ~ X,) nèl, 
denote a consistent sequence of positive estimators of o°. If n is a large, 
nonrandom sample size and if X,,..., X, are observed, then approximate con- 
fidence bounds may be determined. from the approximate normality of the 
pivotal quantity yn (ô, — 8)/ô, in a well-known manner; and the approximations 
may be refined in some cases by using Edgeworth expansions for the distribution 
of yn (ĝ, — 8)/6,, as in Bhattacharya and Ghosh (1978) and Hall (1983). 

Now suppose that a sequential sample is taken; that is, suppose that the fixed 
sample size n is replaced by a stopping time f = t( X., X}, ...); and consider the 
problem of setting approximate confidence bounds for 6 when X,,..., X, are 
observed. This problem arises, in particular, when estimates are required follow- 
ing a sequential test, as in Siegmund (1978,1980). Under modest conditions, 
developed by Anscombe (1952), Yt (6, — 0)/6, may still be approximately normal, 
so that the (first-order) approximation to its distribution is unaffected by the 
optional stopping. Of course, the exact distribution of Yt (ĝ, — 0)/6, is affected by 
the optional stopping, but the effect disappears in the approximation. Siegmund 
(1978) expressed dissatisfaction with such approximations in a special case and 
proposed a (fairly complicated) alternative. 

Here some very weak asymptotic expansions are determined for the distribu- 
tion of yt (6, — 0)/ô, and the coverage probabilities of some associated confidence 
bounds, in the case that F, w € Q, is a one-parameter exponential family. When 
compared to the asymptotic expansions for fixed sample sizes, these expansions 
allow one to determine the primary effect of the optional stopping on the 


Received October 1984; revised November 1985. 

! Research supported by the National Science Foundation under Grant No. DMS-8413452. 

AMS 1980 subject classification. 62112. 

Key words and phrases. Sequential confidence bounds, average coverage probabilities, posterior 
distributions, asymptotic expansions, estimation following sequential testing. 


1049 


1050 M. WOODROOFE 


distribution of yt (6, — 6)/6, in large samples. They also provide a method for 
correcting confidence bounds for the effect of optional stopping and other effects, 
such as skewness. 

The expansions derived here are very weak in the following sense: instead of 
the coverage probability at a fixed but arbitrary w, a collection of average 
coverage probabilities over w in a neighborhood of a fixed but arbitrary w are 
considered. 

The rationale for considering such averages is briefly as follows: average 
coverage probabilities are much simpler and give a better picture of the con- 
fidence level near a given w, than does the value at wp and, in repeated 
applications of any statistical procedure, the frequentist scenario, parameters 
may vary too. These points are developed in Section 5. 

After some preliminaries in Section 2, the main result is presented in Section 3 
and illustrated by examples in Section 4. Asymptotic expansions for posterior 
distributions are reviewed in Section 6 and used to prove the main result in 
Section 7. A refinement is developed in Sections 8 and 9. 

There does not appear to be a great deal known about asymptotic expansions 
for coverage probabilities in the sequential case. Anscombe (1953) and Woodroofe 
(1977) treat a special case. There is current work by Keener (1984), Takahashi 
(1985), and Woodroofe and Keener (1985). Landers and Rogge (1976) develop 
bounds on the error of normal approximation for randomly stopped sums. None 
of these authors considers average coverage probabilities, however. 

Average coverage probabilities have been considered by Stein (1981), who 
outlined a program for comparing the unconditional coverage probabilities which 
result from the use of different prior distributions. Many of the results presented 
here are in formal agreement with those in Section 2 of Stein’s paper. Stein’s 
approach is heuristic, and the application to sequential analysis is not considered. 

Woodroofe (1985) computes average risks for sequential point estimation, 
using different methods. 


2. Preliminaries. Let Q denote a nondegenerate subinterval of (— 00, 00) 
and let F, w E Q, denote a nondegenerate, one-parameter exponential family 
with natural parameter space 2; that is, suppose 
(1) F,(dx) = exp[wx ~ (w)] A(dx) 
for -œ < x < œ and w € Q for some nondegenerate, sigma-finite measure A on 
the Borel sets of (— œ, 00) and that Q consists of all w € (— œ, œ) for which 
f%,e°*A(dx) < œ. Then the function y is strictly convex and real analytic on 
the interior 2° = int(Q); and the mean and variance of F, are 


(2) 9=wW(w) and o? = p(w) 
for œw € 2°, where ’ denotes differentiation. See, for example, Lehmann (1959, . 
Section 2.7). 

Next, let X,, Xo,... be iid. random variables with common distribution F, 
under of probability measure P, for some unknown w € 2°. It is assumed that 
P, œ E€ Q°, are defined on a common probability space (¥, #) and that the 
mapping from w into P(B) is Borel measurable (in w) for all Be 2. 


SEQUENTIAL CONFIDENCE LEVELS 1051 


Write 2° = (w,@), where -œ <w<@< œ; let Q=[w, 3] denote the 
closure of 2° in [— œ, œ]; and let © = y(RQ°) = (8, 9), where ~œ < ĝ < f < œ. 
Observe that 4 is strictly i increasing on Q°, since o? > 0 there. Let ĝ, = X, for 
n 2> 1l, as above, and define 6,, n 21, as follows: if 6, < 8, then ô, = a; if 
0 <Ê, < 6, then yên) = 8; and if 6, > 6, then &, = a. Thus, Ôn may be an 
extended valued random variable for each nol. When 6, E60, however, &, is 
the M.L.E. of w. Next, let 2,, n 2 1, be an increasing sequence of compact 
subintervals of 2° for which US aQ, = 2°; let o,, n > 1, be positive continuous 
functions on Q for which o? = y” on Q, for each n> 1; and let 6, = 0,(6,) for 
n> 1. Thus, 6,, n 2 1, is an asymptotically efficient sequence of positive estima- 
tors of o. 

Let 2, denote the sigma-algebra generated by X,,..., X„ for each n > 1; and 
let ¢,, @ 21, denote a family of stopping times w.r.t. Y,, n 21, which are 
almost surely finite w.r.t. P, for all w € 2°. It is assumed throughout that t,, 
a > 1, satisfy the following two conditions: there is a continuous function x on 
9° for which 


(3) lim E, 
a—> 0 








for a.e. w € 2°; and for every compact 2, C Q°, there is an 7 = n(Ro) > 0 for 
which 


(4) lim a? f Pafta < an} dw =0 

ao 
for p = } in Theorem 1 and p = 1 in Theorem 2. An additional condition is 
imposed in Theorem 2. 

Now consider the problem of setting approximate confidence bounds for 6 
when X,,..., X, are observed. (Here and below ¢ is written for £, to avoid 
second-order subscripts.) Let y denote the desired confidence coefficient, and let c 
denote the yth quantile of the standard normal distribution ®; that is,0 < y < 1 
and ®(c) = y. Let b, n 21, denote a sequence of continuous (real valued) 
functions on ĝ and define c, and ¥, Ta 


5 C, = €+- 
( ) b,(é &,) 
and 
5 la : ne 
n [Yn yn T= C65 oo 
for n = 1. The confidence intervals considered are of the form £, for appropriate 
sequences 6,, n > 1. The confidence curves of such intervals are defined by 
Yal ©) za P,{8 EA) z P {vt (8, PA 0)/ô; s c} 


for w € Q° and a > 1. As noted in Section 5, the behavior of y,(w), a 2 1, may 
be erratic, even in simple cases; but their averages are much better behaved. If ¢ 
is a density on 2°, then the average coverage probability under is defined for 


1052 M. WOODROOFE 


a > 1 by 


(6) Y(€) = Í; Ya( @)E(«o) do. 
Then 
Ya(€) = PË Ve (8, — 0)/6, < ci}, 


where P* denotes probability in the Bayesian model in which w has prior density 
€ and X, X,,... are conditionally iid. with common distribution F, given w. 
Here P* is defined on the space Q x X, and the random variables X,, X.,...,¢ 
a > 1, ete., are injected into the larger space. 

The focus here is on confidence bounds, as opposed to intervals, because 
expansions for intervals may be easily derived from those for bounds. In fact, 
expansions for intervals may be substantially simpler, since some of the bias 
terms may cancel. 


a! 


3. Second-order expansions. It is assumed throughout the paper that the 
prior density € is of the following form: for some interger q > 2 and w < w, in 


2°, 
(7) Elw) = (w — w) $ (w, = w) ilw) foro € 2°, 


where £, is positive and q times continuously differentiable on 2° and (x), = 
max(x,0) for — oœ <x < œ. 


THEOREM 1. Suppose that t,, a > 1, satisfy (3) and (4) and that b, = b for 
all n = 1, where b is piecewise continuous on 2°. Then 


(8) TLE) = v + Trel (E) + of | sees 
where 
TICE) = f [Ve bE + 0 Wee" + $(c? = 1) Veya] do, 


Ya = y/o, and y = O(c), for all £ of the form (7). 
Moreover, (8) is valid, if b,, n = 1, satisfy (20), (21), and (22) below. 


It is easy to describe the proof of Theorem 1. Let 9, denote the sigma-algebra 
generated by X,,..., X} Then 
Yalt) = P*{ Ve (0 = 6,) /é, 2 —c,} 


= [P{ve(a = ô,)/ô; 2 —¢|9,} dps 


for all a > 1. Since posterior distributions are unaffected by optional stopping, 
the posterior probability may be expanded about a normal limit, using the (fixed 
sample size) results of Johnson (1967, 1970); and the expansion may be integrated 
term by term, as in Ghosh, Sinha, and Joshi (1982). The expansion (8) results 


a 


SEQUENTIAL CONFIDENCE LEVELS 1053 


after some algebra and simple analysis. The details are presented in Sections 6 
and 7. 

Under additional modest conditions, the coefficient of 1/ va simplifies, and it 
is possible to make it vanish for all ¢ by a proper choice of b. 


COROLLARY 1. If yx is absolutely continuous on (all compact subsets of ) 2°, 
then (8) holds with 


(9) T(¢) = ? T,(w)&(w) da, 
where 
T, = veb — ove)’ + 4(1 + 2c?) Vie yg. 


Proor. If yk is absolutely continuous, then 
foet do = - f (07k VE dw = EAA — o "ve ) Eda 
2 Q o 
for € of the form (7). The corollary now follows from simple algebra. O 


COROLLARY 2. If yx is absolutely continuous and if yx b = 07 (VK) — 
1(1 + 2c?)Vk Y a.e., then FẸ) = y + o(1/ Va) as a > œ for £ of the form (7). 


Corollary 2 is obvious. 

Corollary 1 may be paraphrased by asserting that y, = y + $(c)I,/Va + 
o(1/ Ya) as a > œ, very weakly (after integration w.r.t. a large class of densi- 
ties). In particular, when 6, = 0 for all n > 1, Corollary 1 gives a very weak 
expansion for the distribution function of vt ( 6, — 0)/6,, 


(10) < is 


1 
+o re 
as a — oo, very weakly. Relation (10) need not hold for any fixed w € 2°, 
however; see Section 5. 

Relation (10) allows a simple determination of the primary effect of optional 
stopping on the distribution of yt (ô, — 6)/6, for large a. Indeed, if ta @2 1, are 
nonrandom sample sizes, say t, =a for a > 1, then (10) holds with x = 1 and 
(yk Y = 0. The difference between (10) with x and (10) with x = 1 arises from the 
optional stopping. 

When ¢, = a, a > 1, and the distributions F, w € Q, are sufficiently smooth, 
(10) holds for all w € 2°, with « = 1; and letting b= 4(1 + 2c”), yields 
y,(@) = y + o(1/ Va) as a > © for all w € Q°. See Hall (1983). Discrete cases, 
such as the binomial, Poisson, and negative binomial distributions, are not 
considered in Hall (1983), but are covered by Theorem 1. 


1054 M. WOODROOFE 


4. Examples. In this section, Theorem 1 is applied to set confidence bounds 
following truncated sequential probability ratio tests (S.P.R.T.) and repeated 
significance tests (R.S.T.). To simplify the notation it is assumed throughout that 
0 € 2° and that ẹ has been so normalized that (0) = 0 = (0). 


EXAMPLE 1 (S.P.R.T). Let 8&7 <0 < &* denote a conjugate pair—that is, 
Y(T) = ~(8*); and consider the problem of testing w < 67 vs. w > 5*. Then the 
sequence of likelihood ratios of 6* to 6” is L, = exp(dS,), n > 1, where ô = 
ôt — 6 and S, = X, + °°: +X, n 2 1. If e> 0, then 


(11) ta = inf{n > 1: |S,] > a orn > a/e} 


is the stopping time of a truncated S.P.R.T. of w = 5~ vs. w = ô* for each a > 1. 
It is clear that (3) holds with k(w) = max{e, |6|}; and it is easily verified that (4) 
holds with p = 1. Thus, Theorem 1 is applicable. 

There is natural interest in the (limiting) function b which makes the coeffi- 
cient of 1/ Ya vanish in (9)—namely, by = (ve)/ove — 1(1 + 2c?)y5. This 
function is undefined and discontinuous where 6 = +e; and it may be desirable 
to smooth this discontinuity with an appropriate sequence b,, n > 1. 


EXAMPLE 2 (R.S.T). Now consider the problem of testing w = 0. Let A, = 
sup,[wS, — ny(w)], n= 1, denote the log likelihood ratio statistics; and let 
0 < 8, < 8, < oo. Then 


t, =inf{n > a/6,: A, > aorn> a/s} 


defines the stopping time of an R.S.T. for each a > 1. It is easily seen that (3) 
holds with x(w) = min{6,, max[8, w8 — ¥(w)]} for w € 2°; and (4) holds for 
any p > 0, since t, 2 a/6, for a > 1. Thus, Theorem 1 is again applicable. 

As above, there is interest in the function by = (vk)’/ove — 4(1 + 2c”)w, 
of Corollary 2. This function is undefined and possibly discontinuous where 
wð — p(w) = ô or 6,, but is bounded on compact R, C 2°. It may be smoothed 
by an appropriate sequence b„, n > 1. 

Similarly, Theorem 1 may be applied to set confidence bounds following tests 
which use Anderson’s (1960) triangular region or Schwarz’s (1962) asymptotic 
shapes. 

In order to apply Theorem 1 and its corollaries to the untruncated S.P.R.T., 
the more general conditions (20), (21), and (22) (below) must be used. See 
Example 3, below. 


5. Average vs. real confidence. Asymptotic expansions for the confidence 
curves y,, @2 1, appear to be more elusive than the simple expansions of 
Theorems 1 and 2 (below). Woodroofe and Keener (1985) give some; but these 
exploit the form of the stopping times involved. The difficulty in obtaining 
expansions for the distributions of randomly stopped sums is noted by Siegmund 
(1985, Section 1.6), in particular. 

Even where obtainable, asymptotic expansions for y, may be more com- 
plicated and less usable than those for 7,. For simplicity, this comparison is 


SEQUENTIAL CONFIDENCE LEVELS 1055 


developed in the context of a symmetric S.P.R.T. about a normal mean. The 
qualitative points made apply more generally, however. 

Suppose that X,, X... are iid. normally distributed random variables with 
unknown mean #, —œ < @ < oo, and unit variance, in which case w = @ in (1) 
and (2). Then, for the untruncated S.P.R.T. [(11) with e = 0], it is possible to 
derive an asymptotic expansion for y, as a > œ. Let 


u($,r) = P,{S, = r, forall k = 1} 


for — œ < r < œ and @ > 0. Further, let — 00 < c < œ, and let N = N(@, a,c) 
and f =/f(@,a,c) denote the integral and fractional parts of the quantity 
a/6 + c*/26? — (c/20*)\(4a8 + c?) for a> 1 and 6 > 0. Then by Theorem 2 
of Woodroofe and Keener (1985) 


P, E(B, = 8) < c} = ®(c) = Teale) f° [1 - w(0,7)] dr 


(12) 1 1 
jose Ele tne) eda 


as a > œ for — œ < c < œ and f > 0. 

Observe that the fractional part in (12) oscillates wildly as a > oo. In view of 
this, changing c slightly in order to make the coefficient of 1/ VN disappear in 
(12) appears to be a delicate question with a possibly oscillatory answer. By 
contrast, the simple approximation (10) holds for all 6 > 0, with x(w) = |6|; and 
Corollary 2 provides a method to change c slightly in order to make the 
coefficient of 1/ Ya vanish. Thus, the average confidence levels y,, a > 1, are 
much simpler than the confidence curves y,, a 2 1. 

The oscillatory behavior of the fractional part f is lost when y, is replaced by 
Ya. This may be an advantage in some cases. For example, if a list of values of 
coverage probabilities on a grid of @-values is desired, then the average coverage 
probability near points on the grid may give a better picture of overall behavior 
than the value at points on the grid, precisely because the oscillations have been 
smoothed. 

In addition to simplicity, average confidence levels may provide a better 
measure of frequentist properties than do confidence curves. To see how, suppose 
that an experiment produces an outcome Y, possibly a vector, from which a 
confidence set C(Y) for an unknown parameter w is to be constructed; and let 
y(w) = P.f{w € C(Y)} denote the coverage probability when w is the state of 
nature. Suppose now that the experiment is repeated N times with parameters w, 
and outcomes Y,, i= 1,..., N. If it is assumed that the parameters are drawn 
independently from an unknown distribution G, then the expected relative 
frequency of coverage is 


XG) = fr(o)G(de); 


and this is also a first approximation to the actual frequency of coverage by the 
law of large numbers. Thus having good frequentist properties requires ¥(G) to 


1056 M. WOODROOFE 


be large for all G of interest. Requiring 7(G) > y for all G is equivalent to 
requiring y(w) > y for all w, the conventional formulation. However, if the G’s 
of interest are all smooth, and if an approximation is allowed to replace the 
inequality, then the two conditions may be quite different, as illustrated above. 
In such cases, the average confidence levels y(G) seem more directly related to 
relative frequencies than do the confidence curves. 


6. Expansions for posterior distributions. Recall that ¢ denotes a den- 
sity of the form (7), and Ri = [op w]. 
Let w* = ynô (w — &,) for n = 1. If g is a bounded measurable function on 
(~ œ, 00), say |g] < 1, and if 6, € Q°, then 
E*[g(w3)|,] = ¢,(@)/%,(1), 
where 
62 


ee) = faie -n] v5, + =) ~v(0y) a |e 


n 


On + 





= dz 
Vné,, 
for bounded measurable g and 1 denotes the constant function. If ô, € Q8, then 
it is straightforward to expand ẹ and ¢ in Taylor series about 6, and perform 
the formal division. To state the result, let ¢, = &//o’€ and y, = ~%/o/ for 
J= 1,2,..., where (7) denotes the jth derivative; also, let m, = [2 „27 d®(z) 
denote the jth moment of the standard normal distribution and let Q(g) = 
fE .(2/ — m,)g(z) d®(z) for bounded measurable g and j 2 1. If w < , < wp 
then 


E*[e(os)/9,] = f? gdo + Tele &(3,) = IOLA] 
(13) = 


1 
+—R?, 
n 
where R? is a remainder term; and, if g > 3, in (7), then 


RY = 4Q.(8)§o(4,) — 404(8)¥9(,)&:(0,) — 49,(2)0,(&,) 
(14) rd 
+4Q6(g)¥3(,) + Tn ee 


where R}, is another remainder term. See Johnson (1967, 1970). 

There is special interest here in the case that g is the indicator of an interval 
[—c, 00), where — oo < ce < œ. In this case Q (c) is written for Q(g). Letting 
@ = ©’ denote the standard normal density, it is easily verified that Q (c) = $(c), 
Qc) = —co(c), Qc) = (2 + c7)g(c), Qe) = —(Be + ¢*)$(c), and Qele) = 
—(15c + 5c? + e°)(c) for ~œ < c< œ. 

The following bounds for R? and R}, n > 1, are adapted from Ghosh, Sinha, 
and Joshi (1982). Let A, denote the event 

logn logn 
(15) an= {op Ee" $8, 50,- E) 


SEQUENTIAL CONFIDENCE LEVELS 1057 


for n > 1. Then a careful examination of the Taylor series expansions and formal 
division show the existence of constants K, = K,(é), depending on ¢ but not on 
n, g, or 2, (if |g] < 1), for which 


(16) |Rala,| < K, (aq = @) 2° + (oy = ân) 7] 
for n > 2 and i = 0, 1, where I, denotes the indicator of an event A. 


Lemma 1. Let and A,, n> 1, be as in (7) and (15) with q > 2. Then: (i) 
oo logn \ 9+! 
PU 44) < K| —= 
(Da) er) 


k=n 


for all n > 2 for some constant K, which is independent of n; and (ii) 


EX sup [(6, ~ 9) 57+ (o ~ ên) 3 lL} < 00. 
n22 


Proor. Let 2, = [wp w,] denote the support of . Then an easy exercise 
using Bernstein’s inequality and simple properties of y’ show the existence of 
Eo = Eol Ro) > 0, ĉo = 69(2)) > 0, and 0 < Ky = K,(Qq) < œ for which 


(17) P {lân — o| = e} < Kexp|[ —ndye"] 


for all 0 < £ < ep w E Qo, and n > 1. Let e, = log n/ Vn for n > 1; and observe 
that ¢, is decreasing in n > 9. Let m > 9 be so large that ©, < £ for all n > m. 
If n>m and if ô, < w+ £, for some k >n, then either w < wp + 2e, or 
|@, — w| > e, for some k > n. So, for n > m, 


P{d,< wo tep Ikn} feo) dw 
Wp 


(18) 
+ f Plân — w| > ep, 3 k = n}é(w) da, 


which is the order (log n/ Yn )**! as n > 00, by (7) and (17). Inequality (18) and 
its dual (at w,) combine to prove (i). 

The proof of (ii) is similar. Let m be as above; let x >1/e,,; and let 
l= I(x) be the largest integer n> m for which «,>1/x. If nzm, A, 
occurs, and (8, — w); > x, then e,<1/x and, therefore n >l. Thus, if 
BUPy, > m(Q,_ — 9); Ta, > x, then either w S o + 2e, or |G, — w| > Ee, for some 
n > L. The probability of the latter event may be estimated as in (18) and is 
easily seen to be of order 1/x1*} as x > œ. Assertion (ii) then follows from this 
result and its dual (at w,), since (ô, — #))j/ and (w, — &,){'are bounded on A, 
for each fixed n > 2. O 


7. Proof of Theorem 1. The conditions on b„, n > 1, in Theorem 1 are 
described next. Let ¢,, a 2 1, denote stopping times which satisfy (3) and (4) 
with p = 4; let R, = [wp w,] C 2° be compact; and define B, = B (Qo) by 


logt log t 
(19) B= {te nd, a+ <a, s 0,~ | 


1058 M. WOODROOFE 


for a > 1, where 7 = n(&,) is as in (4). Let P® denote PË when £ is the uniform 
density on Q,, and observe that PÉ < KP% for some K = K(é) for all of the 
form (7) with support Qg. In Theorem 1, it is assumed that 


P 
(20) (5) [b,(&,)| - Ip, a = 1, are unif. int. w.r.t. P° 
and 
(21) ess sup|b,(@,)/t| - Ig, > 0 (mod P®) 


as a > oo for all compact Q, C 2° and p = }. In addition, it is assumed that 
there is a measurable function b on 2° for which 
(22) b,(&,) > b(w) in P -probability 
as a— œ for a.e. w € 2°. These conditions have been formulated to obtain 
reasonable generality and to simplify the proofs of Theorems 1 and 2. They are 
not especially elegant. The conditions are satisfied if b„, n > 1, converge to a 
continuous limit uniformly on compact subintervals of 2°; but they are substan- 
tially more general. Example 3 illustrates the interplay between ¢t,, a = 1, and 
b,, n 2 1, in (19) and (20). 

The proof of Theorem 1 is given next. The reader may wish to review its 
statement in Section 3. 


PROOF OF THEOREM 1. Fix a density £ of the form (7); let Q, = [wp w] 
denote its support; and define B,, a > 1, by (19). Then P*(B/) = o(1/ va) as 
a —> œ by (4) and Lemma 1. So, 7,() = P&{yt(6, — 8)/6, < cp B,} + o(1/ Va) 
as a > œ. 

It is convenient to first consider w, where ¥/(w) = 8. Recall that w* = vn Ôn 
(w — &,) for n > 1, and define A, by 


(23) a 7 P(o 2 —cp Ba} 


for a > 1. Let 2, denote the sigma-algebra of events which are determined prior 
to time t. Then, since posterior distributions are unaffected by optional stopping, 


a= | Pw? > —cJ,} dPt 
a B, 


(24) = f ao) arts y f f(E) RDE ~ teeda] aP 
rahla 


for a > 1. See (13). The latter three terms are considered separately. 
By (16) and Lemma 1, the last integral in (24) remains bounded as a —> œ; 80, 
the final term is of order 1/a as a > œ. Next, c, > c and 


V(a/t) Qy(cz)E:(&) > vie (@)Q,(c)E(o) = o(e)Ve (o) u) /o(w) E(w) 


SEQUENTIAL CONFIDENCE LEVELS 1059 


in P*-probability as a > oo; and 


V(a72)10,(¢,)] la < E [Câ — w) 5+ lo - 8,) 5°] 


on B, for all a = 1 for some constant K. So, 


fl (F) eos dP! > o(c) [con do 


as a > œ by the Dominated Convergence Theorem and Lemma 1. Similarly, 


E Qal ci) ¥3(@,) dP > (2 + c?)e(c) f Vie pat de 


as a — œ. To estimate the first term of the right side of (24), expand ®(c,) in a 
Taylor series about c as ®(c,) = ®(c) + @(e7)(c, — e) = y + o(c#)b(,)/ Yt, 
where |c#* — c| < |c, — c|. Thus, 


J, 260) dP? = y + Fe AV (F) oer 000 dP! + of 
=y+ Tele) [Weds de n (7) 


as a > œ, since P*(B’) = o(1/ Va) and since (20) and (22) imply the conver- 
gence of the integrals. Substituting these three limits into (24) yields the 
following asymptotic expansion for A,, 


(25) A,=yt ole) [VEDE to yee! — 1(2 + c? Ve ae | dw + of 


as a > o. 

It remains to relate A, to ,(¢). Let v denote the inverse function to y’, so 
that w = v(8). If ô, € Q and n is sufficiently large, then Vn (6, — 0)/6, < c, iff 
626, -¢,6,/ vn iff w > v(6, — c,6,/ vn); and 











ao Cô ` Cô 1 262 
_ en =e nn L p#\ 72 
of 4, | v(8,) v’(8,) yn 2” (0;*) n 
1 





1 
= - —— Ceo 
k Fal "Bln 
for some intermediate point 8* between 6, and 6, — c,6,/ Yn . So, if B, occurs 
and a is sufficiently large, then yt(6,— 0)/6,< c, iff w* > —c?, where c9, 
n 2 1, are defined by (5) with b,, n > 1, replaced by b°, n > 1, where 


b2(,) = blân) — Fo (OR ony 


for ô, € R, and n > 1. The functions 6°, n > 1, may be extended to Q in a 
piecewise constant fashion. It follows that 7,(¢) = A°, + o(1/ Va) as a > o, 
where A9, is defined by (23) with 6,, n = 1, replaced by b}, n > 1. If bẹ nz 1, 
satisfy conditions (20)-(22) with limit b, then it is easily verified that 6°, n = 1, 
satisfy (20)-(22) with limit b° = b + łc?p,. Relation (8) now follows from 
replacing b,, n 2 1, by b9, n = 1, in (25).0 


1060 M. WOODROOFE 


The need for the more general conditions on b,, n 2 1, is illustrated by the 
following example. 


EXAMPLE 3 (S.P.R.T). The stopping time of an untruncated S.P.R.T. is of 
the form (11) with e = 0. Then « = |6| and the function bọ which makes the 
coefficient of 1/ Va disappear has a nonintegrable discontinuity at w = 0. It is 
possible to construct an increasing sequence 2,,, n > 1, of compact subsets of 0° 
for which U% R, = @° and continuous function b,, n 21, for which |8,| < 
min{/n, |bol} on @ and b, = by on Q, for all n > 1. Any such sequence satisfies 
(20)-(22). To see this let 2, denote any compact subinterval of R° and observe 
that |b,(w)| < K,/|6| for all w € Q, and n > 1 for some constant K, = K,(Q)). 
Thus, |b,(4,)| < K,/|6| < K,t,/a and, therefore, /(a/t) |b,(4,)| < K,y(t/a) 
on B, for all a 2 1. Next, using Wald’s Lemma and the fact that 
E.,[sup,.,X7/n] is bounded w.r.t. w E€ Qo, it is easily seen that there is a 
constant K, for which E,,[t,/a] < K,/|6| for all w € Q. Thus, 


Í, Eee dP% < K3ZE®% BE < Ka f 0 By eee 


for all a = 1 for some constant K,. Conditions (20)-(22) follow. 


8. A lemma. This section contains a technical lemma which is needed to 
obtain higher order expansions. As above, elegance has been sacrificed for 
generality in the statement of some conditions. Let 


(26) Ah(ĝ, 6) = sup{|h(s) — h(4)|: js — ôi < 10 — ôl) 
for functions h defined on 8. 
LEMMA 2. Lett, a = 1, denote stopping times which satisfy (3) and (4) with 


p= 1. Let Qo = [w @;] be a compact subinterval of 2°; let 8, = Y(R); and 
let g be a continuously differentiable function on O§ for which 


(27) lg'(8)| < K [lo — &)37+(w, = w);"| 


for all 0 = '(w) E ©§ for some constant K. Define B,, a > 1, by (19) and let 
U,, n 2 1, be uniformly bounded and @,, n = 1, measurable. Then 


z 2 
(28) J, Val (F) la) - e(4)] aE = oliva) 
as a > œ for all of the form (7) with support Q, and q = 3. 

Moreover, (28) continues to hold if g is replaced by a sequence g,, n = 1, of 
continuously differentiable functions for which 


(29) Í, WE ZO apa) 


2 


/3 
= o(va) 


SEQUENTIAL CONFIDENCE LEVELS 1061 


and 


(30) f, VE] seca oy] ape = oc 


as a > œ for some a > 1, where P® is as in (21). 


Proor. If g is continuously differentiable on ©, then the integrand 
in (28) may be written U,/(a/t)[g(9) — e(6,)] = Ia + H,, where I, = 
Ujf(a/t) 8'(8,X9 — Â) and IT, = U,l(a/t) Lg) — 2'(8,)K8 — 8.) for some 
intermediate point 6* on B, for all a > 1. A simple integration by parts shows 
that E*(6 — 6\9,) = (1/t)E*(o§,|9,) on B, for a > 1. So, 


J, ul GEORECIA ar 
° wo Geo ar) | [jea 


for all a 2 1 for some constant K,; and this is of smaller order of magnitude than 
1/ Va under both sets of assumptions on g, by Lemma 1 under the first set. Let 
a > 1 and £ > 1 satisfy 1/a + 1/8 = 1. Then 

x 1/a 

|e 


vaf [| 4PF < Kd LG 


1/8 
x{ f sup [Valĝ, — s1)” apt) 


nznqa 


< 





I, dP* 
B, 








a’(0;*) — 2'( Â.) 








for all a 2 1 for some constant K,. The second factor remains bounded as 
a > œ for any £ > 1. If (27) holds, then the first factor approaches zero as 
a —> œ for any a < ł by Lemma 1 and the Dominated Convergence Theorem; 
and if (30) holds, then the first factor approaches zero for some a > 1. O 


9. Higher-order expansions. There are three important differences be- 
tween Theorem 2, which computes 7,(£) up to o(1/a), and Theorem 1: the 
algebra is substantially more cumbersome; the analysis is slightly more com- 
plicated, since the coefficients of 1 and 1/ Ya must be computed more accurately; 
and some additional conditions are needed to justify the more delicate calcula- 
tions. 

In Theorem 2 it is assumed that the stopping times ¢,, a 2 1, satisfy (3) and 
(4) with p = 1. In addition, it is assumed that 


(31) A ell /(3) z Velo) Jieesa} 


as a — œ with 7 = 7(2,) as in (4) for all compact 2, Cc Q°. Further, it is 


2 


dw 
Ua = o(1/a) 








1062 M. WOODROOFE 


assumed that the functions b,, n = 1, of (5) are of the form 

1 
vn Dons 
where bip 221, and b,, n 21, both satisfy conditions (21) and (22), with 
limits denoted by b, and by, and b?,, n>=1, and bọ» n21, both satisfy 
condition (20) with p = 1. Finally, letting g,(@) = 6,,(w) for 6 = ¥(w) € 0, it is 
assumed that g,, n > 1, satisfy (29) and (30). 


(32) bn F bin + 


THEOREM 2. Suppose that t,, a 2 1, and b,, n2 1, satisfy the conditions 
imposed in the previous paragraph. Then 


1 2 1 ex. 1 
IE) = 7+ ANRE) + Ele) + o3) 
as a — œ for all ¢ of the form (7) with q = 3. Here Î (£) is as in (8) and 
E,(¢) = I [ile - c)p,— 44 — 31e? + 15c)y3] kë do 


(33) + AEE — 26?)o yag — teo?” — co 1b, g]k do 
+ f [3(8e — c*) by, + by — heb? | kë do. 


Proor. Fix a density $ of the form (7) with q 2 3 and support Q= 
[wo w] C 2°; and define B,, a = 1, by (19). Then P*(B/) = o(1/a) as a > œ 
by (4) and Lemma 1. So, 7,(¢) = P&{vt(6, — 0)/6, < cp Ba} + 0(1/a) as a > o, 
as in the proof of Theorem 1. Define A,, a > 1, by (23). Then, also as in the proof 
of Theorem 1, 


(4) A= Ala) + Fe B(a) + =a) + 7) Bla), 


where 


(a) = f 2(a) dP*, 
bila) = f f(E) RDE) ~ esleya] art, 


Bala) = f (F BOLNE) - 8@x(es)¥a(0,)8(G,) 
= HQC) va Be) + Hle) ¥a(4.) | APE, 


and 
a \ 3/2 
= w t dP£ 
Bala) I, 3 E 
for a 2 1. See (13), (14), and (16). 
By (16) and Lemma 1, £,(a) = O(1) as a —> œ. So, the final term in (34) is 
o(1/a) as a > œ. Next, the Law of Large Numbers, the Dominated Convergence 


SEQUENTIAL CONFIDENCE LEVELS 1063 
Theorem, and Lemma 1 combine to show that 


B,(a) > B? = 1Q,(c) Í rêt dw — 4Q,(c) if Ky s6ig do 


~4Q,(c) | ryg do + 4Q,(c) Í Kyé dw 


as a — œ, as in the proof of Theorem 1. For £,, expand Q,(c,) and @,(c,) in 
Taylor series about c to find that 


Bila) = Qo) f f() 6a) art- sese) f, (F) tta) ar 


+ xf (Fea rlenene(a,) n LACAL CA) dp‘ 


1 
= B,,(a) — B,(a) + Tq Pal), say, 
where c; and c; are intermediate points. As above, it is easily seen that 


Bila) > Bi = Qile) fabs Edo — 4QK(c) f rbiya de. 


B3% = Q(c) [veeEdw and BY = $Q,(c) f Ve vsé do. 
Q 2 
Then 


Bula) = 8 = Qe) f yf($) eX) = tico) apt 


(35) 
° seo E - Veo) lesto) dP! — Qile) f Vt dP! 


for all a > 1. Here the first term on the right is o(1/ Va) as a > œ, by Lemma 
2; the second is o(1/ Va) as a > œ by (31); and the third is o(1/ Va) as a > œ% 
by Schwarz’s Inequality, since P*(B/) = o(1/a) as a > œ and Vr¢, is square 
integrable w.r.t. Pt. This shows that |8,,(a@) — 8?,| = o(1/ Va) as a > 00; anda 
similar argument shows that |8,,(a) — 8%] = o(1/ Ya), as a > œ too. Finally, 
consider B,(a), a > 1. Expanding ®(c,) in a Taylor series about c and using 
P*(B‘) = o(1/a) yields 


pola) =l) + rlo S (F) 6.40, art 
+f fo ota ~ bet pt] ao 


Rae l 1 1 2 1 
=1+ rea) + — AH) + oZ), say, 


1064 M. WOODROOFE 


as a > œ for some intermediate point cf between c, and c. As in the proof of 
Theorem 1, conditions (20)-(22) yield 


Bia) > BE = 9(c) [x by — xed] do 


as a —> oo. Let 8! = $(c)fove bE dw. Then |B(a) — B!| = o(1/ Va) as a > œ 
by an argument similar to that of (35), using the second set of conditions in 
Lemma 2 and the integrability of «b7é. 

Let 8, = B? + Bi and $, = $? + Bz + Bz, where £? = Bh, — Bi. Then 


1 1 1 
(36) A, = ®(c) + =A, + TB: + of =] 


as a — oo. To complete the proof, let v denote the inverse function to y’. If a is 
aay large and if B, occurs, then it is easily seen that ve (6 1, - 9)/6, < c, iff 

= ¥té(w — &,) = —c?, where c°, n > 1, are defined by (5) and (32), but with 
i n > 1, and bp n 2 1, replaced by be, n> 1, and b3,, n > 1, where 


bo,(&,) oy binl Ân) t tepalân) 


Bin( On) = Don( Gq) + Ya) | CBy(O,) + se t, + go” (07) crb 
for some intermediate point 6# between 6, and 6, — c,6,/ Yn for &, € Q, and 
sufficiently large n. If b,,, n > 1, and b.,, n > 1, satisfy the conditions of the 
theorem with limits b, and b,, then it is easily seen that b°,, n > 1, and 5b8,, 
n > 1, satisfy these conditions too, but with limits b? = b, + c*,/2 and b? = 
by + eb, — ch ,/6 + 43/2. Let A9, a > 1, denote A,, a > 1, whenc,, n > 1, 
are replaced by c?, n > 1. Then Y (é) = A° + o(1/a) as a — œ. The "theorem 
now follows from (36) and simple (if tedious) algebra. 

Recall that T'\(¢) may be written in the form (9), if yk is absolutely continu- 
ous, and that I'\(¢) = 0 for all ¢ of the form (7), if yxk is absolutely continuous 
and Vk b = 0 }(YxY — 2(1 + 2c?)Ve y, ae. Analogous results hold for T,. O 


COROLLARY 3. Suppose that yx is absolutely continuous on Q°, that (VK Y is 


locally square integrable, and that yx b, = o` (Vx + vig, where g is absolutely 
continuous. Then there is a function T, on 2° for which 


(37) PE) = [Ta(w)€(w) de 
for all densities & of the form (7). 


Proor. If yx is absolutely continuous on 2°, then so is x and K’ = 2vK (VK Y; 
and, if ¥«b, =o \y«)’ + veg, where g is absolutely continuous, then the 


SEQUENTIAL CONFIDENCE LEVELS 1065 
middle line in (83) may be written 


[Tae — 2c3)o- Want’ — beo ?(kE y — co 'ngt’| dw 
(38) 
~~ {4(00— 208)(0-Wan)’ + ele(a*)'] ~ eae) Edo, 


The corollary follows directly. 0 


The expression for T, is complicated and not especially enlightening. However, 
the following two special cases are of interest. 


COROLLARY 4. Suppose that yk is absolutely continuous and that (yK Y is 
locally square integrable. If ¥«b, =0— (Ve) — 411 + 2c? Vk ws and xb, = 
gle + 3c*)ep, + ale — Tya — heo "ky, + eo *(vky?, then T(E) 
T,(€) = 0 for all ¢ of the form (7). 


COROLLARY 5. Suppose that « is continuously differentiable and that x’ is 
absolutely continuous. If b, = b = 0, then 


Ty = £(8e¢ + 5c?)ky, — 4(15¢ + 17c? + 4c°) Ky? 


+4(3e + 2c8)a7 kp, — 407K”. 


Proors. Corollary 3 follows by setting g = — 4(1 + 2c”), in Corollary 2. 
Corollary 4 follows from an integration by parts which is similar to that 
described in (38). 

Of course, Corollary 5 gives a very weak asymptotic expansion for the distribu- 
tion function of yt (6, — 0)/6, as a > oo. Unfortunately, the condition that x’ be 
absolutely continuous is violated in many examples, including Examples 1 and 2. 

The corrections described in Corollary 4 are applicable in Examples 1 and 2, 
after some smoothing. They are not applicable in Example 3, since (Vx y is not 
square integrable near zero in the example. O 


10. Concluding remarks. The condition that t, be stopping times w.r.t. 
2, n 2 1, may be relaxed. It is only necessary that P,{t, > n} be independent of 
w for all n and a. 

Ghosh et al. (1982) consider asymptotic expansions for posterior distributions 
when the prior density £ decays exponentially at the endpoints of its support, as 
well as € of the form (7). It seems likely that Theorems 1 and 2 hold for such & 
too. 

There is no statement of uniformity w.r.t. £ in Theorems 1 and 2; but it would 
be desirable to have one. In particular, when 7,(£) is thought of as an average of 
Yq near a given point w, it would be desirable to allow the support of £ to shrink 
to w as a > œ. 

If F, is the normal distribution with mean ĝ = œw and unit variance for 
—0 <o < %, then y” =1 and ¢,=0 for all j2 3. So, the function b of 


1066 M. WOODROOFE 


Corollary 2 is simply b = x’/2x and the confidence interval described there may 
be written s, = [6 — c/ Vt, œ), where 6, = 6, =- b(ĝ)/t in this case. Since b 
does not depend on c, the sequence b,, n = 1, may be chosen independently of ¢ 
(if at all), and the term b (ê, )/t may "be regarded as a correction for bias. In the 
context of Example 2, it agrees with the bias correction suggested by Siegmund 
(1978). 

In the normal case with @ = w > 0 and t, = inf{n = 1: S, > a} for a 2 1, the 
very weak expansions of Theorem 1 agree with stronger (fixed 0) expansions for 
an analogous problem with Brownian motion. To see how, let B(s), 0 < 8 < œ, 
denote Brownian motion with drift 6 > 0 and unit variance, both per unit time; 
let + = t, = inf{s > 0: B(s) > a} for a> 1; and let BË =[B(r,) — p1,]/ {Ta 
for a > 1. If — oo < z < œ and a is sufficiently large, then it is easily seen that 
BË <z iff 1,2 m, where m solves the equation a — pm = zym. Using a 
well-known formula for the distribution function of 7, [for example, Siegmund 
(1984), Equation (3.15)] and some simple analysis, 








— mp "| 


P,{ BË < z} =P,{7, > m} = (= Ta — e? (1 — ®)( Ta 


= ®(z) - way) + fz], 


where 0(1/ Va ) is uniform on compacts in — oo < z < oo. The same relation was 
obtained very weakly for the discrete problem in Theorem 1. 


Acknowledgments. It is a pleasure to acknowledge enlightening conversa- 
tions with Bob Berk, Bob Keener, David Siegmund, who suggested the final 
remark in Section 10, and Jim Berger. Thanks to Michael Perlman for the 
reference to Stein (1981). 


REFERENCES 


ANDERSON, T. W. (1960). A modification of the sequential probability ratio test to reduce sample 
size. Ann. Math. Statst. 31 165-197. 

ANSCOMBE, F. J. (1952). Large sample theory of sequential estimation. Proc. Cambridge Philos. Soc. 
48 600-607. 

ANSCOMBE, F. J. (1953). Sequential estimation. J. Roy. Statist. Soc. Ser. B 15 1-21. 

BHATTACHARYA, R. N. and GHosu, J. K. (1978). On the validity of the formal Edgeworth expansion. 
Ann. Statst. 6 434-451. 

Guosu, J. K., SINHA, B. and Jos, S. (1982). Expansions for posterior probability and integrated 
Bayes risk. In Statistical Decision Theory and Related Topics III (S. Gupta and J. Berger, 
eds.) 1 403-456. Academic, New York. 

HALL, P. (1983). Inverting an Edgeworth expansion. Ann. Statist. 11 569-576. 

JOHNSON, R. (1967). An asymptotic expansion for posterior distributions. Ann. Math. Statist. 38 
1899-1906. 

JOHNSON, R. (1970). Asymptotic expansions associated with posterior distributions. Ann. Math. 
Statist. 41 851-864. 

KEENER, R. (1984). Asymptotic expansions in non-linear renewal theory. Technical Report, Statis- 
tics Dept., Univ. of Michigan. 


SEQUENTIAL CONFIDENCE LEVELS 1067 


LANDERS, D. and RoGGe, L. (1976). The exact approximation order in the central limit theorem for 
random summation. Z. Wahrsch. verw. Gebiete 36 269-283. 

LEHMANN, E. (1959). Testing Statistical Hypotheses. Wiley, New York. 

ScHwarz, G. (1962). Asymptotic shapes for Bayes sequential testing regions. Ann. Math. Statst. 33 
224-236. 

SIEGMUND, D. (1978). Estimation following sequential testing. Biometrika 65 341-349. 

SIEGMUND, D. (1980). Sequential chi-squared and F tests and related confidence intervals. Bro- 
metrika 67 389-402. 

SIEGMUND, D. (1985). Sequential Analysis: Tests and Confidence Intervals. Springer, New York. 

STEIN, C. (1981). On the coverage probability of a confidence set based on a prior distribution. 
Technical Report, Statistics Dept., Stanford Univ. 

TAKAHASHI, H. (1985). Asymptotic expansions in Anscombe’s theorem for repeated significance tests 
and estimation after sequential testing. Ann. Statist. To appear. 

Wooproore, M. (1977). Second order approximations for sequential point and interval estimation. 
Ann. Statist. 5 984-996. 

WooprooFE, M. (1982). Nonlinear Renewal Theory in Sequential Analysis. SLAM, Philadelphia. 

Wooproors, M. (1985). Asymptotic local minimaxity in sequential point estimation. Ann. Statist. 
13 676—688. 

WOooODROOFE, M. and KEENER, R. (1985). Asymptotic expansions in boundary crossing problems. 
Ann. Probab. To appear. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF MICHIGAN 
ANN ARBOR, MICHIGAN 48109 


The Annals of Staustcs 
1986, Vol 14, No. 3, 1068-1079 


ASYMPTOTIC PROPERTIES OF NEYMAN-PEARSON TESTS 
FOR INFINITE KULLBACK-LEIBLER INFORMATION 


By ARNOLD JANSSEN 
Universitit Siegen 


In the present paper we will improve the results concerning the rate of 
convergence of the error of second kind of the Neyman—Pearson test 1f the 
Kullback-Leibler information K( P, P,) is infinite. It is pointed out that in 
certain cases the sequence exp(—gq, n) is the correct rate of convergence if 
~a, n denotes the logarithm of the critical value of the Neyman—Pearson 
test of level « and sample size n. Therefore we generalize the classical results 
of Stein, Chernoff, and Rao which deal with the error probability of second 
kind and state that ga n ~ nK(P,, P,) 1f the Kullback—Leibler information is 
finite. Moreover the relation between Qan and the local behavior of the 
Laplace transform of the log-likelihood distribution with respect to the 
hypothesis is studied. The results can be applied to one-sided test problems 
for exponential families 1f the hypothesis consists of a single point. In this 
case it may happen that q., , is of the order n'/? for some p, 0 <p < 1. 


1. Introduction. Let E” = (Q", x&",(Pj, P?)) be an n-fold binary experi- 
ment and suppose that P(A) = [,(dP?/dP,') dP} + P(A N,), A E”, is 
the Lebesgue decomposition of the nth product measure Př with respect to Př 
for some N, E€", nEN. Suppose that the Pj density dPř/dPř of the 
absolutely continuous part of P? is defined to be equal œ on N,. Then we 
consider the Neyman—Pearson test p, „ of level a € (0,1) for the test problem 
H = {PJ} against K = {Py}, 


1 if dP?'/dPy > cy n 
Pa, n7 Ya, n if dP?/dP% = Can > 
0 if dP?/dPS < cy. n 


satisfying Eps, n = a. It is well known that the error probabilities of second 
kind Ep.(1 — Pa, n) and the critical values c, „ satisfy 


(11) lim [Ep1 ~ g,n)]" = lim c3 = exp(—K(Py, P,)) 


if K(P,, P,) = f log(dP,/dP,) dP, denotes the Kullback—Leibler information 
(compare with Chernoff (1956), who referred to Charles Stein; Rao (1962); Krafft 
and Plachky (1970)). Suppose that log is continuously extended on [0, co]. Note 
that (1.1) contains only little information if K(Po, P,) is infinite. Therefore we 
are interested in the correct rate of convergence in the general case. Observe that 
for example the case K(P,, P,) = œ appears in connection with one-sided tests 
in exponential families and the local structure of exponential families; cf. Janssen 


Received December 1984; revised October 1985. 

AMS 1980 subject classifications. Primary 62F05; secondary 62F03. 

Key words and phrases Kullback—Leibler information, rate of convergence of the error probabil- 
ity of second kind, Neyman- Pearson tests. 


1068 


INFINITE KULLBACK-LEIBLER INFORMATION 1069 


(1986). To fix the idea of this paper let us first suppose that K(P,, P,) is finite. If 


(1.2) aa, n —log Cain 
then (1.1) implies 
(1.3) lim | Ep(1 — Pa, n)| Maan — exp(—1). 


Since (1.3) is independent of K(P}, P,) we ask whether (1.3) holds in a more 
general situation. It turns out that this result can be proved for a large class of 
interesting experiments having infinite Kullback—Leibler information. Thus 
1/qq, n18 the correct speed of convergence. Note that 


(1.4) N/Ga,n > 0 iff K( Py, P) = œ. 


On the other hand it is pointed out that there is some connection between q, n 
and the Laplace transform 


(1.5) w(t) = fexp(tlog dP,/dP,) dP, 

of Log dP, /dP,| Po), which exists for t € [0,1]. Note that 

(1.6) d* log w(t)/dt|,.. = —K(Py, P,) 

and 

(1.7) nlog #(1/q,,,) > —1 forn > œ if K(P,, P,) < œ. 


In order to prove the results we apply a convergence theorem for Laplace 
transforms. Only partial results are obtained by theorems for large deviations. 


2. Preliminaries. In connection with the tests P, „ the behavior of the 
log-likelihood distributions of product experiments is used. Therefore we first 
recall some facts for binary experiments E = (Q, £,(P), P,)). Let v, be the 
log-likelihood distribution 
(2.1) vo = L (log dP,/dP,|Py), 


where £(Y|@) denotes the distribution of a random variable Y with respect to Q. 
If » is a finite measure on [— 00, œ) then 


(2.2) w(t) = fexp(ty) du(y) 


defines the Laplace transform of » where exp is continuously extended on 
[— œ, œ]. Let Y be a random variable. Then wy denotes the Laplace transform 
of its distribution. Returning to Æ let us remark that w,, is bounded by 1 on 
[0, 1]. Moreover 


(2.3) ro({—00}) = lim (1 - o,,(#)) 
t>0 
and 


(2.4) (log dP,/dP,|P,)(A) = J, exp(x) dro(x) + (1 — w,,(1))e,.(A) 


1070 A. JANSSEN 
for each Borel subset A of (— œ, co]. In addition 
(2.5) K(P,, P,) = — fxdv(x) and 0< K(P,P,) < œ. 


Suppose that E = {P, P,} and F = {Qo, Q,} are two binary experiments. Then 
2 (log dP, ® Q,/dP, ® QP; 8 Q,) 
= L (log dP/dP|P,) * #(log dQ,/dQ,|@,), 


where * denotes the convolution of measures on the topological semigroup 
([— 00, co), +) equipped with the usual topology; cf. Janssen (1985c), (9.4) and 
(9.21). Let P” denote the n-fold product measure of P, e„ the Dirac measure 
defined by a. Let Ep be the expectation with respect to P and suppose that »*" 
denotes the nth convolution product of v. Two positive functions f,g are 
equivalent for x > 0 or +œ, g(x) ~ f(x), if the ratio tends to 1. 

In the sequel a continuity theorem for Laplace transforms is needed. 


(2.6) 


LEMMA 2.1. Suppose that p,, n= 0, is a sequence of probability measures 
on [~ 00,00) such that sup, >o% (1) < K for some K > 0. Then the following 
statements are equivalent: 


(2.7) Hn Hy weakly, 
(2.8) w,(y) >, (y) forall y € (0,1). 


Proor. Put p, = Bp*&-1gx and define p, = L(exp(-)|p,). Then 
[xdpi(x) < 1 follows and w,() = fx? dp’(x) is the Mellin transform of p. The 
continuity theorem for Melin transforms shows that (2.8) is equivalent to the 
weak convergence of p, to ph; cf. Strasser (1985), Theorem (5.16). Thus the 
lemma is proved. 0 


Note that Lemma 2.1 does not hold in general if w, (1) is unbounded. 


3. Main results. First of all we study the connection between the Laplace 
transform of v) (2.1) and the logarithm q, n = — 10g Ca, n of the critical value c, , 
in order to generalize (1.6). Therefore we always assume that 


(3.1) P,#P, and P,<« P, 


If the second condition is not fulfilled then c, „ = 0 finally holds. Note that (3.1) 
implies that v) # £) and p, is concentrated on R. Put 


(3.2) p = log w, 


Then ọ is a convex function such that y(t) < 0 on (0,1) and lim t= P(E) = 
Thus for n > (1 — log a)(sup,<o,,)/(4)|)” | = y there exists an increasing se- 
quence of reals (k, n)n such that 


(3.3) p(1/k,,,) = —(1 — loga)/n. 


INFINITE KULLBACK-LEIBLER INFORMATION 1071 


LEMMA 3.1. 
(a) Ran San forny; 
(b) loga — 1 < liminfng(1/q,,,) < limsupng(1/qa, n) < 0. 
fee n> o 


Proor. (a) Let Y,,..., ¥,: (2, £, P) > R be independent random variables 
with common distribution rọ. For u € R we apply the following well-known 
inequality of Chernoff (1952): 


P({Y, +--+: +Y, 2 nu}) < P({exp(¢(¥, + --- +Y, — nu)) 2 1}) 
< (exp(— tz), (t))” fort > 0. 
Hence 
loga < log P({Y, + +++ +Y, > —@a.n}) < n(@(t) + tqa, n/n). 


Inserting £= 1/k, „n then the result follows. 
In order to prove (b) we first remark that 


lim n@(1/k,.,) < liminfng(1/q,, n) 


by (a) since ọ is a convex function. Now assume that lim, _,.2,9(t/qq, n,) = 9 
for t = 1 and a subsequence n,. Then the limit is equal to 0 for all t € (0,1). Let 
(Y,)nen be a sequence of independent copies of random variables having the 
distribution r. Then Lemma 2.1 implies that 


ny 


(3.4) laa DO Y, 


gal 
tends in distribution to 0. This yields the desired contradiction since 


Rk 
| X Y> Man 


jai 


(3.5) a>P a2 








It should be remarked that (a) also follows from statement (14) of Krafft and 
Plachky (1970). The next theorem is well known for experiments having finite 
Kullback—Leibler information K(P), P,) without further restrictions concern- 
ing F. 


THEOREM 3.1. Let F be the distribution function of r) = L (log dP,/dP)|P,). 
If F is positive and satisfies 


(3.6) lim sup F(Agx)/F(x) <1 forsomer,>1 
then 
(3.7) lim sup 4, p/Ra,n < ©, 
n> co 
(3.8) [Eml — Pa, n) > exp(—1). 


(3.9) If 0<a,<a,<1 then lim inf Ya, n/ Yaz, n > 0. 


1072 A. JANSSEN 


Proor. Note that (3.7) and (3.9) follow from (1.6), (1.7), and (1.1) in the case 
K(P,, P;) < œ. Therefore we may assume that K(P,, P,) = œ. Then by (2.5) 


(3.10) fxdv(x) = —00. 
Let a(à) denote lim sup, — - ,,F(Ax)/F{x). Then a(à%) < a(A,)” and 
(3.11) Jim a(à)=0 
since F is increasing. First of all we show that (3.6) implies 
(3.12) lim limsupọ(st)/ọ(t) = 0. 
s>0+ 4-404 


Therefore we put w = w, and fix 1 > s > 0. Since p(t) = log w(t) = w(t) — 1 + 
o(w(t) — 1), we remark 
(3.13) fee (st) "a w(st)-— 1 
; sup ——— = lim sup —_.-—_-. 
t>0+ p(t) t0+ w(t) = 


Integration by parts (Feller (1971), page 150) yields 
(3.14) 1- w(t) = tf? F(x)exp(te) dx ~ g(t), 
where 
a(t) = f C - F(x))emp(t) de 
is continuous for £ > 0+ , g(0) < co. Thus 


1— w(st) _ J? „F(x/s)ezp(tx) dx — sg(st) 
1~ w(t) [oF (x )exp(tx) dx - g(t) ` 


Note that the denominator of the right-hand side tends to infinity for t > 0 + 
because of (3.10); see Feller (1971), page 151. Therefore it is sufficient to prove 


[°.F(x/s exp( tx) dx (2) 
3.16 lim Se eR ie EE Ao 
ee or [oaeF(xexp(tx) de ~ “ 
Choose xo < 0 such that F(x/s) < (a(1/s) + e)F(x) for x < xo. Then 


f° #(> et) dx < |x| + {<(] + e] f? Fla)exp(tr) dx. 


Hence (3.16) follows since the denominator tends to infinity for t > 0 + . In the 
sequel we omit the index a if possible and write C„, Pns ans kne Now it will be 
pointed out that (3.12) implies (3.7). 

Suppose that there exists a subsequence such that lim, ogn ọn, = œ. Then 


o((1/kn, )(tkn/an, )) 


p(1/k,,) 


(3.15) 


3 . 





(3.17) n,9(t/q,,) = (—1 + loga) 


INFINITE KULLBACK-LEIBLER INFORMATION 1073 


if we take (3.3) into account and observe that is decreasing in a neighbourhood 
of 0. 

In view of (3.4) and (3.5) statement (3.17) is impossible. In order to prove (3.8) 
we remark that 


618) fQ-9,)dPrsf exp) deg"(x) < expl gn) 


if we observe (2.6), (2.4), and note that the absolutely continuous part of 
£ (log dP? /dPJ|P) has the v” density exp(x). Next we prove the converse 
inequality. Let Y;: (Q, £, P) — R be a sequence of independent random variables 
having common distribution v). Then for d > 1 and W, = LY, 


fa-9,)aPre f exp( W,) dP 
{-g,> Wrz — Qn} 


+(1 — y,)exp(—9n)P({W, = an})- 
Suppose that n, is a subsequence such that 
(3.20) | [Q-9,) dPP 
is convergent. Then we prove that either 


(3.21) a= exp(-1) or liminf P( {q7 "W, € (—d,~1)}) = 8 > 0. 


(3.19) 


| V/an, 


>a 


Therefore we introduce the distributions p,, = £(q, 'W,|P). Since log w, (1) = 

ng(q;,') is uniformly bounded (compare with Lemma 3.1) we note that (p,,), i8 a 

tight sequence of probability measures on [ — 00, 00). Therefore each subsequence 

has a weak accumulation point in the set of probability measures on [— 00, 00). 
In order to prove (3.21) let us assume that 


(3.22) P| {q; W,, € (-d,-1)}) +0 
for some subsequence of n,. Passing to a further subsequence we may assume 
that 
(3.23) Hn, > @ 
tends weakly to some probability measure Q on [— œ, œ). 

Let us first assume that Q is a Dirac measure. Since — 1 is a (1 — «)th quantile 
of u „ we conclude Q = e_,. Together with (3.22) we note that p, ((— œ, —1)) > 0. 
Thus (1 — ¥,,)P({W,, = —_,}) 2 1—a since (1 — a) = p,((—0, -1) + (1 - 
Yn nl{ = 1}). Hence 
(3.24) [(1 — y,,Jexp(—9n,)P({ Wa, = —9n,})] 7" > exp(-1) 


and (3.19) proves (3.21). 
Suppose in the second case that Q is not a Dirac measure. Then we first show 
that Q is concentrated on R. Therefore we apply Lemma 2.1 which proves 


log g(t) = lim n,9(t/q,,) for t € (0,1). 


1074 A. JANSSEN 


Using the technique of (3.17) we conclude that 


t>0+ 
by (3.12). The assertion follows from (2.3). 
Next we shall prove that 
(3.26) (—d,-—1) is contained in the support of Q. 


Since Q is the limit distribution of an infinitesimal triangular (or null) array it is 
infinitely divisible. If Q has a normal factor then the claim is obvious. Otherwise 
the Lévy measure ņ of Q is nontrivial and 


(3.27) n,F(q,x) > n((-%,x]), x<0, 


(3.28) n,(1 - F(q,,2)) >n([x,0)), x>0, 


for all continuity points x of the distribution function of n; see Feller (1971), 
page 585. Let us prove that vanishes on (0,00). Therefore consider 
{e°exp(x) dvo(x) < œ which implies n(1 — F(nx)) > 0 for all x > 0. By (1.4) the 
inequality 1 — F(q,x) < 1 — F(nx) follows for sufficiently large n. Moreover let 
us show that 


(3.29) n((—e,0)) > 0 foreach e> 0. 


Since Q is not degenerate there exists a continuity point x, € (—e, 0) such that 
n((— 00, xol) > 0. Choose 4 >A, such that A~‘xy is a continuity point of 
x ~— n((— œ, x]). Then by (3.6) and (3.27) 

F(9nX0) n((—0, xol) 


3.30 1> lim = 
MeO. a i> FI Gn,” Xo) n((—00, A~ xol) 


It is well known that the support of Q is equal to R or (— œ, c] for some cE R 
in view of (3.29); compare with Tucker (1975). Note that the case [c, 00) can be 
excluded since Q contains a factor which is a compound Poisson distribution 
having the support (— 00,0]. Since —1 is a (1 — a)th quantile of p„ the inequality 
c = —1 holds and (3.26) follows. The Portmanteau theorem yields 


(3.31) liminf P{{9;'W,, € (—d, -1)}) = Q((-d, -1)) 
and (3.21) is proved. Returning to (3.19) we compute 


timint| fQ - on) ap] 


t> œ 


(3.32) > lim [erp( -dgn )P({(47 W, (~d, =D))] 


= exp(—d). 
If d tends to 1 then the result follows since all accumulation points a of (3.20) 
are equal to exp(— 1). 


The proof of (3.9): Suppose that liminf g., n/qa,,n = 0. Passing to subse- 
quences there are probability measures @,,@, on R such that Ha, n, = 


INFINITE KULLBACK-LEIBLER INFORMATION 1075 


Plan Wal P) > Q, weakly for j = 1,2 and 
(3.33) lim Ga,,n/ Vas, n, = O- 
ttoo 


Note that —1 is a (1~—a,)th quantile of Bann But (3.33) implies that 
(Aa, n/ da, nX Waaa, n,) tends to 0 in distribution, which is impossible. 0 


EXAMPLE. Suppose that the distribution function F satisfies the condition 


(3.34) F(x) ~ |x| -°L(x)B(x) forx > —o, 
where the index p is positive, B is bounded away from zero and infinity: 
(3.35) 0<m< B(x) <M forx < Xp. 


Let L be a positive function varying slowly at — oo, ie, L(xA)/L(x) — 1 if 
x — —oo for all A > 0. Then (3.6) is satisfied. The condition (3.34) implies that 
K(P,, P,) = œ ifp <1 and K(P,, P,) < œ for p > 1 (use integration by parts). 


It should be remarked that (3.6) is not satisfied if F is slowly varying at — oo. 
In this case the rate of convergence of q, ,, can be computed and (3.9) is no longer 
true. 


THEOREM 3.2. Suppose that the distribution function F of v, is positive and 
slowly varying at — œ. Then 
(3.36) nF(—qa n) > —loga, no(a.) > —loga, 
Qa,,n 


(3.37) lim 


=0 f0<a,<a,<1. 
no dann 


‘Proor. We use the notation of the proof of the preceding theorem. Integra- 
tion by parts yields 


(3.38) w(t) = f (1 — F(x))texp(tx) dx 
and 
(3.39) t-1(1 = w(t) ~ f? F(x)exp(tx) de - J (L - Fla))exp( tx) de. 

— 00 0 
The second term on the right-hand side is continuous for t > 0 + . Moreover 
(5.22) of Feller (1971), page 447, can be applied to F(O — )~ Woj- 40,0). Thus 
(3.40) — o(t) = -log w(t) ~ 1 — w(t) ~ F(-1/t) fort >0+. 


According to Lemma 3.1 let us choose a subsequence n, such that n,p(q;') > 
c < 0 is convergent. By (3.40) 
n,log w( sq7' Fi -s7'q, 
(3.41) gg a L Mee Ma) Qn) oy 
ees n,log w( 97.) 109 F(-@,,) 


1076 A. JANSSEN 


~ 


for each s € (0,1). Thus 
(3.42) lim n,9(sq;,') = 


Let (X,), en be a sequence of iid. random variables having the distribution vg. 
Then n (80 1) is the logarithm of the Laplace transform of Y, = q,'L",X,. 
By Lemma 2.1 Y,, tends in distribution to Q where wo(s) = exp(c) for s € (0, 1). 
Hence 


(3.43) Q=(l-a)e_,+ae,  a=exp(c). 


Moreover the value —1 is a (1 — ath quantile of the distribution of Y,. 
Therefore a = a and (3.36) is proved if we take (3.40) into account and remark 
that c = log a does not depend on the special choice of the subsequence. 

Note that by Seneta (1976), Theorem (1.1), 


Fas) 
oe 


uniformly in s for s€{a,b], 0<a<6b< oo. If we assume that s, = 
Van, 2, Vay, aô > 0 is bounded away from 0 then (3.44) contradicts 


F(- Gas,n,8 n) a log a, 5 
Lee I G55.) . log a, 





(3.44) 





(3.45) 


4. Applications to one-sided test problems in exponential families. In 
contrast to the classical] results (1.1) and (1.6), the rate of convergence of the error 
of second kind and the critical value depend on the level a if K(P), P,) = œ 
Therefore let us study some examples and applications. 

In connection with local behavior of one-sided test problems for a simple 
hypothesis the critical value of the test can be estimated if the hypothesis 
belongs to the domain of attraction of a stable law; see Janssen (1985b), (1986). 
Therefore suppose that @ is a probability measure on R such that wg is finite on 
some interval [0, b) for b > 0. Let E = (R, Z, (Qa)s eto, s) be the exponential 
family generated by Q such that 


(4.1) Fs) - - na 


Let us consider the test problem H:{Q,} against K:{Q,: # € (0, b)} of sample 
size n. 
The most powerful a test is equal to 


exp(#x) for0 << b. 


n 
1 if Lx, as 


tml 


Pa, n(Xir -3 Xn) = Yn if Èx, = aa! oe 


wl 


n 
0 if Ex, < San 


1m1 


INFINITE KULLBACK-LEIBLER INFORMATION 1077 


if Eqn@,,, = a. Note that 
(4.2) exp(-8g,,, — nlogug()), 0<9, 


is a critical value of the level a test for {Q"} against {Q3}. Suppose in the sequel 
that Q belongs to the domain of attraction of a nondegenerate stable law P with 
the index p > 0 of stability on R, i.e., 


(4.3) L(T,\Q)"" *e,, > P 


weakly for some a, € R and a sequence (6,),, of positive numbers tending to 0. 
Define T, (x) = 6,x. As pointed out in Janssen (1986) 

(4.4) nl- a,n) +a,> Ua 

follows where u, is the (1 — a)th quantile of P. If p > 1, the first moment of Q 
exists and K(Q, Qs) is finite. The classical results imply 


(4.5) p 7 Ga, n)| me = exp(—K(Q, Q;)), o> 0, 
where K(Q, Qs) = —#/xdQ(x) + log wo). Note that (4.5) does not depend on 
a. If p < 1, the first moment of Q no longer exists and hence K(Q, Q) = œ for 
o> 0. 

Let us first study the case p < 1. Following the arguments used in Janssen 
(1985b), (1986) let us remark that P is one-sided stable on R having a support 


(4.6) supp P =(—o,a] forsomea €R. 
The centering procedure of Feller (1971), page 580, implies 
(4.7) ap>a since ¥(T,|Q)*"*e, ,7>P*e_, weakly. 


In order to estimate the rate of convergence we recall some known results 
(Janssen (1985b), (1986)): 


(4.8) log wp y) = —c,yP + ay forsome c, > Oandall y > 0, 

(4.9) log wg( y) ~ T(2 - p)(p - 1) °c2yPL(1/y), y>0+, 

(4.10) Q((—,t]) ~ e4 PLA), t> -o, 

where c, > 0 and L is a positive function varying slowly at infinity. Suppose 
that k, , is increasing defined by 

(4.11) log wgl k34) = —(1 — loga)n™! 

for sufficiently large n. If a,, a. € (0,1) then 


(4.12) fa, nian gg (a 7 u,, (a ~ ta) a 


if we observe (4.4) and (4.7). 


THEOREM 4.1. If p <1 then 
(4.13) ô 


n 


(4.14) nlogwe(Gz',) > -e(a — u.) ”, 


7 = 1/p 
(4.15) gto =) (an): 


a,n 


Gan? &@— Uy, 


1078 A. JANSSEN 
If a, > a & [0,1] then ( for uy = a, u, = — œ) 


(4.16) lim | Za; (1 = E = exp(#(u, — a)). 


n> 


Proor. Let q, ,, be defined by (1.2) for {Qg} against {Q5}. Then 
(4.17) Qa, n7 Pa, n +n log Wa(#). 


According to (1.4) the result follows from (3.8) and (4.4) since ĝa n/Qa,n > ® for 
fixed a. In general the result follows from the monotonicity of the error function 
for a and the continuity of the right-hand side of (4.16). 

The proof of (4.14) and (4.15): Put b = (1 — log a)/c,)'/"(a — u,). Then by 
(4.9) 


log wel bå) 
log wgl kzn) 

since n log wg(,) > log wp(1) = ~c, because L(T; |Q)*" > P*e_, weakly; 
compare with Janssen (1985b), (1986). 


It is well known that (4.18) implies ka, nban > 1 since log wg is regularly 
varying; compare with Janssen (1985a), Lemma 7c. 0 


(4.18) >] 


REMARK 1. It is well known that ô, is of the order 
(4.19) §, =n V?L(n), 


where L denotes a positive function varying slowly at infinity. If the index of 
stability p is less than 1 then (4.13), (4.16), and (4.17) show that (4.19) contains 
the correct rate of convergence of the interesting quantities for a € (0,1) which is 
faster than the classical speed of convergence 1/n. 

REMARK 2. A straightforward calculation proves that, in view of (4.16), ọ E 
can be substituted by the test sequence Y,„, «a for fixed level a which was proposed 
in Janssen (1986), Theorem 13. Note that y, , is an asymptotically most 
powerful test sequence for {Qj} against {Q5 9}. 


REMARK 3. For p = 1 we only have a partial result. If (w9(8,y))"exp( yan) 
tends to wp( y) = exp(cy log y) for y € [0,1] and some c > 0, then Q belongs to 
the domain of the one-sided stable distribution P with index 1. In general a, 
does not converge and Theorem 4.1 no longer holds. For example, if P = Q for 
c = 1 then Q is not strictly stable and 

1 
dan = “Nu, — niog—, a, = —log—, 6,= 7) n> 2, 

, n n n 
and (4.17) yields the connection between 7, , and qa, ,, where u, is the (1 — «)th 
quantile of P. In this special case we obtain that for fixed # > 0 the exponent 


INFINITE KULLBACK-LEIBLER INFORMATION 1079 


1/@q, n of (1.3) and (3.8) is equal to 


1 1 1 


1 
q 
an log wo(®) — u, — log— 
og wgl ) Ua an 


which is a faster rate of convergence than ô„ = 1/n. 


REMARK 4. In the case of Theorem 3.1 statement (3.9) proves that the order 
of the rate of convergence of 


(4.20) log Epe(1 = Pa, n) 


is the same for all a € (0,1). But the exact rate of convergence of (4.20) may 
depend on a; cf. (4.16). On the other hand, (4.16) shows that a different order of 
the rate of convergence occurs if «, > 1. This phenomenon is also new compared 
with the results of Krafft and Plachky (1970) who showed that (1.1) still holds if 
(1-a, > 1, 


REFERENCES 


CHERNOFF, H. (1952). A measure of asymptotic efficiency for teats of a hypothesis based on a sum of 
obsérvations. Ann. Math. Statist. 23 493-607. 

CHERNOFF, H. (1956). Large sample theory: Parametric case. Ann. Math. Statist. 27 1-22. 

FELLER, W. (1971). An Introduction to Probability Theory and Its Applications 2. Wiley, New York. 

JANSSEN, A. (1986a). ‘The convergence of almost regular statistical experiments to Gaussian shifts. In 
Proceedings of the 4th Pannonian Symposum on Mathematical Statistics, Bad Tatzman- 
nsdorf Austria 1983 (F. Kopecny et al., eds.) 181-204. 

JANSSEN, A. (1985b). One-sided stable distributions and the local approximation of exponential 
families. Selected Papers, 16th EMS Meeting, Marburg, 1984. Statist. Decisions Suppl. 2. 

JANSSEN, A. (1985c). The Lévy—Khintchine formula for infinitely divisible statistical experiments. In 
Infinitely Dwisitble Stahstical Experuments. Lecture Notes ın Statistics (A. Janssen, H. 
Milbrodt, and H. Strasser, eds.) 27 Springer, Berlin. 

JANSSEN, A. (1986). Scale invariant exponential families and one-sided test problems. To appear in 
Statst. Decisions. 

Krarrt, O. and PLacaky, D. (1970). Bounds for the power of likelihood ratio tests and their 
asymptotic properties. Ann. Math. Statist. 41 1646-1664. 

Rao, C. R. (1962). Efficient estimates and optimum inference procedures in large samples. J. Roy. 
Statist. Soc. Ser. B 24 46-72. 

SENETA, E. (1976). Regularly Varying Functions. Lecture Notes ın Math. 508. Springer, Berlin. 

STRASSER, H. (1985). Mathematical Theory of Statistics. De Gruyter, Berlin. 

Tucker, H. G. (1975). The supports of infinitely divisible distribution functions. Proc. Amer. Math. 
Soc. 49 436—440. 


UNIVERSITAT SIEGEN GHS 
FACHBEREICH 6 MATHEMATIK 
HOLDERLINSTRASSE 3 

5900 SIEGEN 21 

WEST GERMANY 


The Annals of Statistics 
1986, Vol. 14, No. 3, 1060-1100 


STOCHASTIC COMPLEXITY AND MODELING 


By JORMA RISSANEN 


IBM Almaden Research Center 


As a modification of the notion of algorithmic complexity, the stochastic 
complemty of a string of data, relative to a class of probabilistic models, is 
defined to be the fewest number of binary digits with which the data can be 
encoded by taking advantage of the selected models. The computation of the 
stochastic complexity produces a model, which may be taken to incorporate 
all the statistical information in the data that can be extracted with the 
chosen model class, This model, for example, allows for optimal prediction, 
and its parameters are optimized both in their values and their number. A 
fundamental theorem 1s proved which gives a lower bound for the code length 
and, therefore, for prediction errors as well. Finally, the notions of “prior 
information” and the “useful information” in the data are defined in a new 
way, and a related construct gives a universal test statistic for hypothesis 
testing. 


1. Introduction. The purpose of statistical model fitting is to “ understand” 
the observed data. If by understanding we mean the ability to remove redundan- 
cies in the data and hence to discover regular statistical features, then the 
ultimate measure of the success of such attempts must be the length with which 
the data can be described, say, in terms of binary digits. Indeed, if such a shortest 
description of the data, to be called stochastic complexity, is found in terms of 
the models of a selected class, there is nothing further anyone can teach us about 
the data; we know all there is to know. This is the rationale behind the MDL 
(minimum description length) criterion, which we, inspired by the algorithmic 
notion of information, Solomonoff (1964), Kolmogorov (1965), and Chaitin (1975), 
introduced in Rissanen (1978) and (1983) in a particular nonpredictive form. The 
criterion also reduces to the maximum likelihood criterion in the special cases 
where the use of the latter is appropriate. 

We may regard our work as a continuation of the program that Fisher began 
with his information, which is defined in terms of the covariance of the estimated 
parameters about a “true” parameter. Wishing to remove the untenable assump- 
tion of data generating systems and “true” parameters, we instead regard the 
class of models to provide a language in which to express the regular features in 
the data. Because then the models cannot be restricted to have a fixed number of 
parameters Fisher’s information does not apply, and we must consider the 
complexity and information directly in the observed data. That the resulting 
complexity still is a meaningful concept is evidenced by the fact that even 
prediction errors can be so expressed. The fact that no “true” parameters are 
needed in our notion implies that the associated optimal model is not an 


Received July 1984; revised October 1985. 

AMS 1980 subject classifications. 62A99, 62M10, 62F03, 60F99. 

Key words and phrases. Inference, number of parameters, model selection criteria, prediction, 
coding. 


1080 


STOCHASTIC COMPLEXITY AND MODELING 1081 


approximation of anything at all. Rather, it acquires its own data dependent 
meaning from the two interpretations of the complexity, the first as the shortest 
code length and the second as the smallest prediction errors. As a further 
consequence, any two models, even with different numbers of parameters, can be 
fairly compared, which, for example, in the important selection-of-variables 
problem pretty much settles a major question about the least-squares estimates 
left open by Gauss, namely, how to estimate the number of the regression 
parameters. 

The first contribution in this paper is to define the notion of stochastic 
complexity, especially when the coding is done in a predictive manner. This will 
provide a criterion, which, unlike the earlier nonpredictive MDL criterion, 
penalizes the number of the parameters in the fitted models without any 
explicitly added term. The associated predictive MDL modeling principle turns 
out to be closely related to the “ prequential” principle, discovered independently 
by Dawid (1984) for the case where the data have a natural order. Still another 
related idea, called “forward validation”, appears in Hjorth (1982), where it, 
however, was used in a traditional manner to provide unbiased estimates or 
estimates with reduced bias. 

The main contribution in this paper is a fundamental theorem, which sets a 
tight lower bound for the code length with which long strings of data can be 
encoded with help of a class of models. Because prediction is just another form of 
coding, the theorem also gives a universal lower bound for the mean prediction 
errors of any predictors. The theorem may further be used to assess the goodness 
of estimators, which, unlike in the Cramér—Rao bound, may include the esti- 
mates of the number of the parameters as well. This theorem with a companion 
theorem, stating that the complexity achieves the lower bound, may be taken to 
provide a rational basis for model comparison, regardless of whether the models 
have the same number of parameters or not. 

As the third contribution of this paper we formalize concepts such as “useful 
information,” and “ prior information” in the data. We also define a universal test 
statistic for hypothesis testing, which appears to have a number of advantages. 
We illustrate the idea by a test of two-way contingency tables. 


2. Predictive MDL principle. The probabilistic models we consider consist 
of indexed densities /,(x|u), or ultimately probabilities P(x|u), where x = 
X1,..-,X,, also written as x”, denotes a sample of length n as a response or 
“output” to another “input” sample u = u” of the same length. Because the 
input sample adds nothing new in principle, we drop it to simplify the notations; 
we illustrate its use in Example 2 below. We denote both random variables and 
their values with lower case letters, letting the context tell which is meant. The 
data items are often numerical, but, of course, not always. When numerical, each 
number in the binary notation, say, has only some number r of fractional digits. 
Hence, when the model is a density it assigns a probability to x, which is 
obtained by integrating the density over the n-dimensional cube of edge length 
277 with x as the center. We denote this induced probability function by P (x) 
without indicating the implicitly understood precision r, which we otherwise do 


1082 J. RISSANEN 


not need. The index a may be taken sufficiently general to allow comparison of 
nested and nonnested models alike. However, it is the number of parameters that 
turns out to be the interesting quantity, and we for simplicity take the index to 
be of the form a = (k, 6), where k denotes the number of components in the 
parameter vector 6 = (6,,..., 6,), and k = 0,1,.... The value k = 0 corresponds 
to the empty parameter A. 

We are interested in predicting the sequence x as well as coding it. The former 
may be viewed as a special predictive form of coding, and we gain generality by 
proceeding with the coding interpretation. Often, we wish to model the data such 
that the individual observations are independent. Then, instead of coding a 
sequence the relevant problem is to encode the n-element unordered “list” {x,}, 
where repeated occurrences of a value are preserved. The required modification 
for such a case will be discussed below. Predictive coding means that we model 
the conditional density for the possible values of the “next” observation x,,, 
thus 


(2.1) Par Coweta 


where 6(¢) = 6(x*) is obtained with an estimation algorithm for the parameter 0 
with k components. Such a density allows us to encode the observation x,,, to 
the precision r with the “ideal” code length —log Py, a¢(%14:|2'), which, as just 
explained, is represented by —log fy, 9¢4)(%14112"). The word “ideal” means that if 
the possible values of the next observation indeed are distributed as modeled, 
then no prefix code exists with a shorter mean length. Whenever we wish to 
express the code length as the number of binary digits in the coded string, the 
logarithm is to be taken to the base 2; otherwise, its base does not matter. By 
adding all these ideal code lengths, we get the total code length 


n-1 
(2.2) L(x\|k) = — d log fe, a Felt"). 
t= 


This may be minimized with respect to k to give the estimator k(n) = k(x"), 
which with the last data point defines the final estimate 6(n) having k(n) 
components. 

How should we select the estimate 6(t) for each k? On first thought one might 
think of picking it so as to minimize the ideal code length —log fp, 9(x,,,|x‘). 
But, clearly, this cannot be done, because such a minimization would make 6( tha 
function of x,,,, which, in turn, would make decoding impossible. Indeed, 
decoding of x,,, requires the knowledge of A(t), which therefore must not 
depend on the value x,,, to be decoded. We are faced with the central issue in 
inductive inference, and we reason as follows: In the light of past observations 
the best single value y the parameter for encoding the “next” observations, 
Xap i= 0,1, — 1, is the value that minimizes the sum 
—Lirolog fr, (2, alt" ). Te is the maximum likelihood estimate 6(t), except 
that we add the restriction that the predicted density (2.1) is positive for every 
possible value of x,,,, which is required to make (2.2) meaningful for all data 
sequences. We might then say that this choice for the estimator 6(t) is based 


STOCHASTIC COMPLEXITY AND MODELING 1083 


upon the hope that the predicted distribution (2.1) for the new observation x,,, 
is like it was in the past, which to us seems to be as sound a principle for 
statistical inference as any. After all, by its nature inductive inference is based 
precisely upon such a faith; the same reasoning was also applied in the “pre- 
quential” procedure, Dawid (1984). 

The minimization of (2.2) requires the initial estimate 6(0) for each number of 
components k. The traditional way to calculate such is to select more or less 
arbitrarily a prior density function for the parameters and then take one of the 
maximizing values as the estimate 6(0). The predictive approach, however, offers 
a different way, and one which avoids the both conceptually and technically 
difficult problem of specifying the prior densities. Indeed, what (2.2) really 
requires is the specification of a density function f(x,) for the first observation 
such that it reflects our prior knowledge about its value. Technically, we may 
take this density function to be in the parametric family and specified by the 
empty parameter A. Such a distribution is often much easier to pick than a prior 
for the parameters. For example, if the prior knowledge consists of the fact that 
the set of possible values of x, is finite, M, put —log f(x,) = log M. The 
procedure to compute (2.2) for each selected number of parameters k is then as 
follows: The first observation x, is encoded with the ideal code length —log f(x,), 
where the density is selected to represent our knowledge, often ignorance, about 
the value x,. We continue encoding the next observations with this same density 
until one parameter can be uniquely fitted, and we increase the number of fitted 
parameters in this manner one by one until the set value k, needed in the 
evaluation of (2.2), is reached. 

The minimized code length (2.2) does not quite represent the complexity of the 
sequence x, because it is conditioned on the optimizing number of parameters, 
which clearly is required in the decoding process. This value can be given in a 
coded form as a preamble in the entire code string. Because the decoder will have 
to be able to separate the binary codeword representing k(n) from the subse- 
quent code of the data without a separating comma, the preamble must be a 
so-called prefix code. As discussed in Rissanen (1983), encoding the natural 
number & by a prefix code requires 


(2.8) L*(k) = log*k + loge 


binary digits, where log*k = log k + loglogk + ---, the sum including all the 
positive iterates, and c is the constant, about 2.865, that makes £°_,2~2°™ = 4, 


Therefore, we may define the (semi) predictive (stochastic) complexity of the 
sequence x, relative to the selected class of models, as 


(2.4) lgp(x) = min {L(x|k) + log*k + c}. 


The word “semi” suggests that the optimizing number of parameters, which we 
still write as k(n), is not determined the predictive way. To avoid misunder- 
standings we emphasize that the main effect for penalizing the number of 
parameters in (2.4) is by no means due to the second term, log*k. In fact, in most 
if not all the cases of interest the minimizations of (2.2), where no such term 
appears, and (2.4) produce exactly the same number of parameters, which is why 
we may safely use the same symbol to denote both. 


1084 J. RISSANEN 


We can apply the above discussed inductive reasoning to obtain a purely 
predictive complexity. Indeed, let k(t) denote the minimizing number of parame- 
ters in (2.2), where n is replaced by t. Then we may regard the pair (2(t), 6(t)) to 
represent our best estimate of the conditional density for the possible values of 
the “next” observation x,,, available at time ¢. Adding the resulting ideal code 
lengths we get the purely predictive (stochastic) complexity as follows 


n-l 
(2.5) I(x) = — 2 log fren, aglar), 
t= 0 


where k(0) = 0 and 6(0) = X, representing the empty set of parameters. In other 
words, the initial density f(x,) is determined as described above. 

In the case where the observed data do not form a natural sequence, and they 
are modeled as independent, we should modify (2.2) by minimizing it over all 
permutations. In other words, we should find that order which allows for the 
shortest code for the unordered list. Except for very small sample sizes such a 
search is far too complex, and we construct a symmetric function of the data by a 
local optimization procedure as follows: 


n-l 
oy ee) = 2 euP yy TE Ír, dezan: s5 )} 
where the minimizing index j defines i(t + 1). The associated predictive and 
semi-predictive complexities for the unordered list of observed data are then 
defined analogously with (2.5) and (2.4). We illustrate this procedure by Example 
2 below. 


Discussion. In Rissanen (1978) and (1983) we, in effect, defined a third, 
purely nonpredictive notion of complexity, which to within terms of order log n is 
given by 


k 
(2.7) Iy() = min { —log fy, o) + gogn}. 


This formula results from a particular way of coding the data, where the second 
term represents the number of digits required to encode & parameters to an 
optimal precision. Clearly, this nonpredictive stochastic complexity cannot be 
meaningfully interpreted in terms of prediction errors. The criterion (2.7) is seen 
to be identical in form but not in scope nor in content with Schwarz’s criterion, 
Schwarz (1978). 

We now have three different versions of complexity for a sequence. This 
abundance is a reflection of the difficulty in defining the notion of complexity in 
an objective way for short data sequences. The trouble arises just as soon as we 
try to make precise the way one is allowed to use the models in the selected class 
to do the coding. For example, one cannot permit the estimators (£) to be 
completely arbitrary, because there may exist the estimator that assigns the 
probability unity to the actually occurring value x,,, for each ¢, which would 
give a perfect prediction. To be sure, such an estimator would require the 


STOCHASTIC COMPLEXITY AND MODELING 1085 


knowledge of this value, but for a given sequence such a “predictor” exists as a 
mathematical function. One could bar out such estimators by requiring their 
description to have a uniformly bounded length, independent of the sample size 
n, but then we would be back in the algorithmic notion of complexity, and we 
would not be able to use our model classes in any meaningful way as the 
“language” in which to look for the regular features in the data. 

For these reasons, in the three notions of complexity the estimators 6(£) to be 
used are specified one way or another. Although the chosen estimators, of course, 
are meaningful and natural, their selection was done on subjective grounds just 
the same, which is what we ideally would have liked to avoid. (Such qualms 
might be considered as being of no practical significance, but our aim in this 
paper is to seek a foundation for statistical reasoning which is as free from 
arbitrary choices as we can make it.) The situation improves rapidly with the 
growing sample size n, because then all the three notions of complexities tend to 
be equal, and, moreover, all of them, indeed, represent asymptotically the 
shortest code length, calculated per observation, available with any ways of doing 
the coding. This, of course, is the reason why it at all is meaningful to talk about 
complexities. We prove such results in the next section. 


EXAMPLE 1. Consider the class of Bernoulli models. Hence, k = 1 and @ = p, 
the probability of the occurrence of symbol 1. In lack of prior knowledge the first 
symbol is taken to occur with probability 4. Hence no prior density in the 
parameter space nor the Bayesian formalism is needed. Having observed m 
occurrences of the symbol 1 in the past ¢ symbols, we form the estimate 
p(t) = (m + 1)/(t + 2), which is seen to modify the maximum likelihood esti- 
mate such that the values 0 and 1 are avoided. By an easy induction the entire 
string of length n with n,, occurrences of symbol 7, i = 0,1, gets the probability 
P(x) = n,!n,!/(n + 1)!. Notice that the sum of this over all strings of length n 
is unity. The predictive and the semi-predictive complexities, taken either for 
strings or for unordered lists, agree, and they all are given by —log P(x), which 
by Stirling’s formula also can be written as 


(2.8) I(x) = na| =°) + łHogn + O(1/n), 


where H( p) = —p log p — (1 — p)log(1 — p). 


We next study to what extent prediction error measures can be interpreted as 
code lengths, which at the same time illustrates how the large classes of models as 
studied here are typically generated. Let £,,, = g(x‘) denote a parametric 
predictor of x,,,, where the parameter is to be determined from the past data. 
Usually, the predictors are defined by recurrence equations such as of the ARMA 
type. One may view this process as a means of accounting for the dependencies in 
the data, which when done well causes the prediction errors to be nearly 
independent. Next, let 5,(x,,,,%,,,) denote a measure of the prediction error. 


1086 J. RISSANEN 


Now define 
(2.9) (gan) = K(x', 9) Q~BHer Feed), 


where K(x‘, 8) denotes that number for which ff, o(y|x‘) dy = 1. We then see 
that 


—log fete) = (X1 2+1) — log K(x’, 0) 


represents an ideal code length for the observation x,,, given the past data. With 
a suitable estimator 6(¢) = 6(x*) the total ideal code length takes the form 


(2.10) L(x|k) = bce Da p log K(x‘, 6(t)). 


We see that this predictive MDL criterion differs from the first sum, involving 
the prediction errors, only to the extent the second term depends on k. Most of 
the usual prediction error measures actually depend only on the difference 
€141 = X41 — 2441, and, moreover, often the possible values of x,,, range from 
— œ to +00. Then we see that K(x‘, 6(t)) = K(x‘), and the difference between 
the two criteria amounts to a constant. 

With the quadratic error function, giving rise to gaussian models, the predic- 
tive MDL principle reduces to a predictive least-squares principle, which we 
illustrate in the important selection-of-variables regression problem; for ARMA 
estimation, see Rissanen (1986a). 


EXAMPLE 2. The observations consist of n tuples, (x(i), u,(z),..., Um(i)), 
{= 1,...,m,m+1 elements in each, where m, the number of the regressor 
variables or “inputs” in our terminology, may be large. The basic problem is to 
find out which subcollection of the regressor variables gives the best prediction of 
the variable x. In order to simplify the description, we only look for subcollec- 
tions consisting of the first k variables, and ask for an optimum value for k. The 
general case is similar except numerically more complex. We consider a linear 
predictor of the usual type 


k-1 
(2.11) &(i) = 2 a,u,(2), 


where u(i) = 1. We measure the prediction errors by the sum of the squares, 
lX T er) = lejo where €,,.; = X1 — 241. Then folx,,1lx*), defined by 
(2.9), is seen to be normal with mean 2,,, and variance 1. 

Ignoring initial knowledge we look for the smallest observation x(i,) according 
to (2.6), which we predict as 0. For k = 0, which means that we ignore all the 
regressor functions, the best fit for a) from the past observations at times 
i,,..., i; is the average 4,(t) = 1/tL5_,x(i,), which is taken as the prediction of 
that observation x(z,,,), among those not yet predicted, for which the prediction 
error is smallest. Adding such prediction errors (x(i,) — @o(t))? over all the data 
gives L(0). For k = 1 we still predict x(z,) as 0. Recursively, suppose we have 
calculated i,,..., t, (which need not coincide with the indices found for k = 0), at 


STOCHASTIC COMPLEXITY AND MODELING 1087 


which points the corresponding prediction errors e7(i,) also have been de- 
termined. We find 4, ,(t) and â, (t) by minimizing L}_,(x(i,) — a — bu,(i,))? 
with respect to a and b. This gives the prediction error e(i,,,) = X(t41) — 
lii) where 2(j) = âo (t) — 4,,(¢)u,(7), and i,,, denotes the index of the 
variables not yet predicted for which the prediction error is smallest. Adding 
again the squared prediction errors over all the data we get the sum LO). We 
continue this way calculating L(k) for each k, and we find the minimizing 
number È = k(n), which, in turn, defines the least squares estimates for the real 
valued parameters from all the data. 

This routine for fitting polynomials to a scattered points, marked by hand on 
` the screen of a computer, was programmed. The displayed optimum degree 
polynomials agreed in an uncanny way with the best polynomial judged by the 
human eye. Also, it was found to be essential to compute the predictive complex- 
ity for an unordered list, rather than for the data ordered in a random way, to 
avoid initialization problems. We proved in Rissanen (1984b) that under reasona- 
ble conditions on the regressor variables the estimates £(n) are consistent, and 
that the estimates of all the parameters are asymptotically optimal in minimiz- 
ing the mean per observation prediction errors E(1/ n)S(k(n)) in the case where 
such a proof makes sense, namely, where a “true” set of parameters exists. These 
results, in effect, settle the issue of how one should estimate the number of 
regressor variables, because prediction is the very reason we want them. 


3. Main bound. The main result to be stated requires certain smoothness 
conditions on the parametric densities, which determine the way certain esti- 
mates of the parameters converge. These conditions need a verification in each 
individual class of models. By the theory of large deviations they can be shown 
to be satisfied for the class of Markov chains, as shown by H. Kiinsch in 
Rissanen (1986b). By a rather different (and difficult) analysis they can also be 
verified for the gaussian ARMA models. For the gaussian regression case, 
Example 2 above, the required conditions are trivially satisfied. 


THEOREM 1. Let for each k the parameters @ range over a compact subset 
Q* with nonempty interior of the k-dimensional Euclidean space. We assume 
that there exist estimates 6(x") satisfying the central limit theorem such that the 
tail probabilities are uniformly summable as follows 


(3.1) P,{v¥n||6(x") -8| = logn} < 8(n), forall 0, and )°8(n) < œ, 


where ||0|| denotes a norm. If g is any density defined on the observations, 
satisfying the compatibility conditions for a random process, then for all k and 
all 6 € Q}, except in a set of Lebesgue measure zero, 


_, Er, ologi fy, o(x")/e(x”)] 
(3.2) lim inf (h/Dlog'n 21. 


The mean is taken relative to the distribution defined by f, 9. 





1088 J. RISSANEN 
The proof is given in Appendix A. 


Discussion. The claim can also be stated thus: For all k, all positive 

numbers e, and for all points 6 € Q*, except in a null set, 
f k, a(x”) i 

(3.3) l Ep, log a(x") > (4 -e)klogn, 

for all but finitely many values of n. 

This theorem has many uses, the most important of which is that it justifies 
the notions of stochastic complexity thereby providing a rational basis for model 
assessment regardless of the number of parameters in them. To see this we first 
demonstrate how the theorem may be regarded as a generalization of Shannon’s 
famous coding theorem. Let L(x) be any real-valued function, interpreted as a 
code length, which satisfies the inequality 


(3.4) L(x) = —logg(x), all x, 


where g(x) is a density defining a random process. Because it integrates to unity 
over the sequences of the same length, we see that (3.4) requires the length to 
satisfy a generalized Kraft inequality. But g also satisfies the compatibility 
conditions for a random process, which is reflected in a similar but weaker 
requirement for the code length. These conditions are natural enough, and all the 
usual codes satisfy them. We call such a code length regular for brevity. By 
applying the theorem to the density g(x) the inequality (3.4) converts (3.3) to the 
desired result, 


(3.5) E, o L(x") = Hy, (n) + G = e)k logn, 


where H, (n) denotes the entropy of the data sequences of length n. Shannon’s 
inequality results from k = 0, which represents the case of a model class having 
only one member, and fixing n. 

It is readily seen that (3.4) holds for the semi-predictive complexity L(x) = 
lsp(x) and 


g(x) = F ERO TAs IA). 
kml 
where L(x|k) is given by (2.2). The predictive complexity, too, is regular, for it 
satisfies (3.4) with equality, as seen by putting g(x) = 27%, The nonpredictive 
complexity (2.7) satisfies (3.4) up to terms of order log n. The inequality (3.5), 
then, gives a justification for the term “minimum description” in Rissanen 
(1983); there we described a particular coding scheme but did not prove that it 
produces an asymptotically shortest code length. 

Further, Theorem 1 gives a lower bound for the accumulated prediction errors 
resulting from “honest” predictors. By an “honest” predictor we mean one where 
the prediction of the ¢’th data item is made as a function of the previous items 
only. Such one-sided predictions guarantee by Bayes’ theorem that the resulting 
minimized joint density is both proper and satisfies the compatibility conditions 
for a random process, as was seen to be the case with the predictive complexity 


STOCHASTIC COMPLEXITY AND MODELING 1089 


above. Hence, Theorem 1 applies. The qualification of “honesty” is necessary 
because the cross-validation technique in Stone (1974) and Geisser and Eddy 
(1979) involves prediction which is “dishonest” in the sense that a data item is 
predicted both from the “future” and the “past” values alike. Evidently, in a 
given sequence, only one intermediate value can be meaningfully so predicted; 
the other data items are needed in this prediction and, therefore, there is no point 
in predicting them. A specific statement of a bound for “honest” predictions for 
the regression problem and the gaussian ARMA processes is given in Rissanen 
(1984b) and (1984a), respectively. We give here the latter statement, where the 
quantifications are weaker than in Theorem 1 for the reason that at that time we 
had not verified the condition (3.1) for gaussian ARMA processes. 


THEOREM 2. Consider the set of gaussian ARMA( p, q) processes, where the 
pt+q+1 parameters 6 = (a,,...,a,, o,-.-, b,) range over a compact subset 
Q?+9+*! of the (p + q + 1}dimensional Euclidean space with nonempty interior. 
In particular, b, = o, where o? denotes the variance of the innovation process. 
Let 2, be any predictor of x, as a measurable function of the past data x*~'. 
Then for all p and q, all positive numbers e, and for all points 6 € Q?*4*}, 
except in a set A (n), the volume of which shrinks to zero as n grows, 


bee +q-eé 
(3.6) BY EES o*(1 + PTE ihn n). 


t=] 


Many criteria for estimation appearing in the literature can be expressed as 
the negative logarithm of a product of conditional densities for the data items 
with possibly some added terms to penalize the number of parameters, to be 
minimized over the parameters. In view of Theorem 1 there is a growing 
suspicion in this author’s mind that unless the minimized criterion satisfies the 
inequality (3.4), the estimation will run into one or another kind of trouble. Prime 
examples of this are the maximum likelihood function and the ordinary least- 
squares criterion, neither of which can be applied to estimation of complex 
models where the number of the parameters is also to be estimated. The 
cross-validation criteria were meant to rectify this problem, but they appear to 
be asymptotically equivalent with Akaike’s AIC, which does not satisfy (3.4), and 
they do not allow a consistent estimation of the number of the parameters, Stone 
(1977a, b). Hence, it is not the idea of “cross-validation” that does the trick but a 
normalization (3.4). For the same reason this author is skeptical of the Bayesian 
attempts to introduce improper priors. In contrast, the MDL principle allows any 
sorts of “prior” assignments, whether they are determined from a part of the 
data or from no data at all so long as they produce decodable parameters. And 
always the optimized code length is regular. 

Another form of Shannon’s inequality states that the Kullback—Leibler dis- 
tance between a “true” density function f and any modeled density g is 
nonnegative. We can sharpen the inequality if we specify only that the “true” 
density is one in a family, and we allow the modeled density to result from any 
estimation procedure; for example, we may take g(x) = TI, fau% 1|x*) where 


1090 J. RISSANEN 


a(t) = (k(x*), 6(x')) is some estimator. Then by (3.2) not only is the 
Kullback—Leibler distance nonnegative, but it must be strictly positive at least 
by an amount which reflects the uncertainty that the “true” parameter is one in 
the chosen class. This suggests the interpretation that this amount (£/2)log n 
represents the optimal model complexity. We may thus view (3.2) as providing a 
yardstick for the goodness of an estimator a(x) by comparing the associated code 
length —log g(x) for long strings with the nonpredictive complexity, which 
represents the lower bound in Theorem 1. This is particularly appropriate 
because all the three complexities appear to reach the lower bound asymptoti- 
cally in an “almost sure” sense; see for example the discussion in Dawid (1984), 
which also gives a nice justification for the term (k/2)log n. 

Tt seems to us that, indeed, the three complexities can be rigorously shown to 
reach the bound —log f, (x) + (k/2)log n for almost all samples, perhaps even 
more simply than in the “mean” sense, but we cannot see any way at all to prove 
that no shorter regular code length exists for almost all samples. About the 
reachability of the bound in Theorem 1 for the semi-predictive complexity, we 
can supply a proof only in selected cases. One such is the basic gaussian 
regression problem, Rissanen (1984b). Another is the class of Markov chains, 
Rissanen (1986b). A third appears to be the important ARMA class, although at 
this writing the job is not quite finished. Here, we prove such a result with 
independence conditions. 


THEOREM 3. Let the family of densities satisfy the conditions for indepen- 
dence for each k and 9 € Q*, namely, fa (x) = TI? fa, o(%,), and let fy, o(x,) be 
three times continuously differentiable with respect to 0 in the interior of a 
compact set Q*, Further, let the central limit theorem hold for some estimates 
6(x”) of 8 in the interior points such that the four first moments of Yn (8(x") — 8) 
conpergë; Then lgp(x), defined by Eq. (2.4), is optimal in that for all k and all 
Bin Q*, 


k 
(3.7) Isp(x") < —E, olog fy o(x”") + gogn + o(logn), 
where o(log n)/log n goes to zero. 
The proof is given in Appendix B. 


4. Information in experiments. In this section we wish to define in a 
formal way such frequently used intuitive notions as “information in an experi- 
ment” and “prior information.” The first of these has been defined earlier in 
terms of the Fisher information and in some contexts in terms of the 
Kullback—Leibler information, Gokhale and Kullback (1978) and Lindley (1956). 
Although both concepts do have the right flavor, their scope is limited to model 
classes with a fixed number of parameters, and a change of view is needed to 
generalize them for the model classes studied in this paper. As regards the “ prior 
information”, the early definition by Lindley (1956) in terms of Shannon informa- 
tion, is strictly restricted to the traditional way of modeling prior knowledge as a 


STOCHASTIC COMPLEXITY AND MODELING 1091 


distribution about a “true” parameter value. In other words, the only source of 
uncertainty stems from sampling, which, of course, is a grossly simplified view 
and leads to absurdities of the kind that the importance of parameters and their 
estimates can be measured solely in terms of how narrow the distribution of their 
estimates is. 

Intuitively, by “useful information” in the data we mean something that we 
can learn. This certainly cannot be the “disorder,” which we measure by complex- 
ity. Rather, such information must have the nature of regular features or 
constraints that we can discover, which suggests that it could be measured in 
terms of the reduction in the total code length below a certain neutral level, 
obtainable with use of no model at all. In order to calculate such a neutral level, 
we use a universal prior density on nonnegative real numbers, which we can 
define by help of the universal distribution for the integers constructed in 
Rissanen (1983) and given by (2.3). This universal distribution provides asymp- 
totically the most efficient coding of integers. For example, it follows from the 
work in Bentley and Yao (1976) on the function log* that the length of any prefix 
code sequence on the natural numbers must exceed L*(n) — 2k*(n) infinitely 
many times, where k*(n) denotes the number of terms in log*(n). Because the 
second term grows very slowly, we conclude that the universal sequence is not far 
from the least asymptotic upper bound for all probability sequences on the 
positive integers. That such a bound cannot be reached by any sequence is not a 
serious practical defect, and we feel quite free to use L*(n) as an excellent 
representative of the universal prior. We extend this universal distribution to a 
universal density for the positive real numbers as follows: 


1 
(4.1) a(o) = <2, 


where Y is the smallest integer greater than or equal to y. This has the property 
we need: It accurately represents the complexity of any truncated real number. If 
a number y has r fractional decimal digits, the above density assigns to it the 
probability 10~'q*(y), and its complexity may be taken as the negative binary 
logarithm of this probability. For example, the number 275.233 has the complex- 
ity 10 + log*276 = 22, and there is no way to describe this number with fewer 
binary digits while maintaining the additional requirement that the description 
can be decoded even when it is followed by other binary symbols; i.e., that the 
code is a prefix code. And it is precisely this property which we regard as the 
foremost requirement in any prior that can claim universality; for example, it 
follows that scale changes and other such transformations cannot reduce the 
complexity in a substantial way. 

The universal density at the observed sequence x is taken as q*(x) = 
TIZ q*(x,). Now we define 


(4.2) T(x) = —log q*(x) — Ux), 
where I(x) is one of the complexities defined in Section 2, to be the information 


in a statistical experiment, defined by x and the considered class of models. This 
represents the amount of “ useful” information in the data. The word “useful” is 


1092 J. RISSANEN 


to be taken only in the sense of extractable regular features relative to the 
considered class of models. Two strings of data might well have the same amount 
of useful information, but we might consider one of them to be more valuable for 
some practical purpose. The measure also corresponds to intuition: If the data is 
“random” in the sense that it cannot be compressed by any model, then the 
useful information is zero, or near zero, as it should be. On the other hand, if we 
have guessed the model class right, then the best model that gives the complexity 
also incorporates the maximum amount of useful information; there is nothing 
more to learn from the data with the proposed models. However, another model 
class might be found which compresses the data more, and which allows us to 
learn more. 

On intuitive grounds we would expect the information [”(x) to be positive, if 
the chosen model class is reasonable. To show that kind of property we calculate 
the mean of this information. If we consider classes of models which satisfy the 
assumptions in Theorems 1 and 3, then by putting g(x) = q*(x) we see that the 
mean information over strings of length n satisfies 

x 
Ey, el"(x) = By, slog tE) — Ep o[l(x) + log fr, o(x)] = —elogn, 
under the qualifications regarding 0, n, and e stated in (3.3). Here I(x) denotes 
one of the complexities in Section 2. This is really the extreme case where we 
have picked a worthless class of models. In any other reasonable class, l(x)/n 
will be below —log q*(x)/n by the order of a constant, and the mean useful 
information will be strictly positive even for finite values of n. 

In our general philosophy of modeling there are no data generating probabilis- 
tic systems nor “true” parameter values. Therefore, it is meaningless to measure 
the amount of prior knowledge in terms of Shannon information of the random 
variable taking values in the parameter space with some prior distribution. 
Indeed, a narrow concentrated distribution does not represent a great amount of 
prior knowledge unless the center of concentration represents a “good” parameter 
value; that is, one which selects a model from the family which captures well the 
regular features in the data. The Shannon information and hence Lindley’s 
measure of prior knowledge are independent of the most important source of such 
knowledge, namely, the location of the concentration. For these reasons we define 
the prior information differently. Let 4(0) = (2(0), 6(0)) be a prior estimate of the 
number of the parameters and their values, respectively. As explained in Section 
2, these estimates may define an empty parameter A, which selects a special 
density /,(x,) from the family. We define 


fac) - 
q*(x) 


to represent the prior information, provided by the prior estimates. The dif- 
ference 


(44) I(x) =I%(x) - 1%(x) = log faol) — Ux) + L*(R(0)) 
clearly represents the information provided by the likelihood function alone. 


(4.3) I°(x) = log log*k(0) 


STOCHASTIC COMPLEXITY AND MODELING 1093 


What about positivity of these last two notions of information? It seems to us 
that there should be no reason why just any prior parameter value ought to be 
able to extract useful information from the data. In fact, we might even do worse 
than what is achievable with the universal density q*(x), so that the prior 
information might be negative. However, when the sample is large we should 
definitely expect to learn something about it with the likelihood function so that 
the mean of I"(x) — I°(x) should be positive. This, indeed, can be shown under 
the same qualifications as the positivity of the mean of I”(x). Just pick g(x) = 
faol x) in Theorem 1, and apply both Theorems 1 and 3. 


5. Model testing. In this concluding section we wish to illustrate the wide 
scope of the notion of stochastic complexity by applying it to hypothesis testing. 
Consider a set of models { f,,(x)}, representing a composite null-hypothesis, and 
another disjoint set { f {x)}, representing a composite alternative hypothesis. The 
indices a = (k, 0), and a’ = (k’, 6’) are arbitrary. Let lx) and U(x) denote the 
semi-predictive stochastic complexities of x relative to the two classes of models, 
respectively. Then we take the difference D(x) = W(x) — lx) to be a test 
statistic, and decide in favour of the null-hypothesis if D(x) = 0, and against it, 
otherwise. Notice that in case of two simple hypotheses this test statistic 
coincides with the likelihood ratio, which is known to be the most efficient test 
statistic. A related test statistic was also considered in Dawid (1984). 

This testing has a number of advantages over the traditional testing proce- 
dures. First, even composite hypotheses get represented by a single model, 
namely, the model which gives the complexity of the data relative to the class of 
models, Hence, there is the possibility of learning by finding better and better 
model classes. Second, our test does not require knowledge of the distribution of 
the test statistic, and, hence, it is valid for small and large samples alike. It is 
clear that the validity for small samples will have to be verified by applying the 
test to a large number of actual cases, where the results are known. We have 
studied a half a dozen of them, and in each the test result appears highly 
reasonable. For large samples an analytic validity test can be made, and again the 
results appear to be good; see the examples below. Third, the size of the test is 
automatically adjusted to the amount of observed data. One might argue in favor 
of a subjectively selected size, but even if we can easily do the same we cannot see 
why such a thing ought to be done. After all, such a number adds nothing to the 
amount of information that can be extract from the data, and hence it is quite 
irrelevant in deciding which of the two considered hypotheses is the more likely 
explanation of the data. After this question is settled, we regard it as a separate 
matter to judge the consequences of acting on the result, which depend on issues 
and values that have nothing to do with the data. We illustrate our approach 
with two examples. 


EXAMPLE 3. Consider the null-hypothesis p = } against the alternative p + } 
in the class of Bernoulli models. We have by Example 1 in Section 2, 


n+ 1)! 
(5.1) D(x) = log 2 -= n = tlogn —n(1 — H(n,/n)). 


1094 J. RISSANEN 


We accept the null-hypothesis if D(x) = 0, which means that in order for us to 
accept the opposite hypothesis, the ratio n,/n must differ sufficiently from 1 to 
overcome the “cost” of one parameter. One can show that for sample sizes up to 
about 1000, this test is close to the traditional test with the confidence level 
about 0.05. For longer strings, the automatically given confidence level shrinks 
gradually to zero, as it should. Notice that in the ordinary way of doing the 
testing there is no cost associated with the number of parameters, and hence the 
opposite hypothesis would always win by a direct comparison. This is why a 
direct comparison of the likelihoods cannot be made, and one must, instead, 
introduce an artificial threshold. 


EXAMPLE 4. As a less trivial example, consider a two-way r X 8 contingency 
table with the ith entry being n,,. The observations x in reality consist with a 
sequence (7), J1) -- -s (ins Jn), where 1 <i, <7r,1<j, <8, and n,, denotes the 
number of times (i, 7) occurs in x. Let the class of models be the set of 
multinomials with 0 = {P(ij) = p,,} as the parameters. The null-hypothesis 
states that the cell probabilities are determined by r + s marginal probabilities 
P., = P,P, expressing independence, while the opposite hypothesis claims that no 
such restrictions exist. Hence, the model defined by the null-hypothesis has 
r + s — 2 free parameters while the competing “free” model requires re — 1 free 
parameters. 

We compute the predictive code lengths for the sequence x with the two 
hypotheses. Just as in the case of Bernoulli models in Example 1 in Section 2, 
with which the first symbol in a binary sequence is assigned the probability 4, we 
imagine that each cell in the table has the initial content of 1 under both models. 
This assigns to the first cell occurrence in the string x the probability 1/rs. After 
this occurrence the corresponding cell content in the table under the “free” model 
is incremented by 1, which gives the table for the assignment of the probability 
for the second occurrence, and the process is repeated. We see that the probabil- 
ity assigned to the entire string x under the free model is 
Tn! 


(5.2) Pp(x) s (rs = Dotee D 


Analogously the probability of the string x under the independence assump- 
tion is obtained as 
rs — 1)! 
(5.3) P(x) = (3-0 Te, +8-I!]](n,+r-!, 
H t J 


where n, = L,n,,, and similarly for n ,. We then have D(x) = log( P;(x)/Pr(x)), 
which with Stirling’s formula gives 


(5.4) 2D(x)=(r-1)(s—- Dinn- Enna + O(1/n). 


We reject the independence hypothesis if D(x) <0. The sum term in (5.4) 
is exactly the so-called Kullback G? measure. It has asymptotically a x? 


STOCHASTIC COMPLEXITY AND MODELING 1095 


distribution with (r — 1)(s — 1) degrees of freedom. Therefore, for large values of 
n, our test is like the ordinary test except that the arbitrarily selected level of 
confidence is replaced by the first term. However, our test is valid even for small 
values of n, when the x? distribution for the G* measure and, hence, the 
ordinary test are not justified. 


APPENDIX A 


PROOF OF THEOREM 1. We pick k and write fọ and E, for the density and 
expectation, respectively, defined by a parameter vector with k components. 
We also denote the so-induced probability measure by P,. For each parameter 6 
in Q* let J,(@) denote a closed neighborhood of radius r, = logn/ Yn with 6 as 
its center. Define for the process determined by @ the set of its 6-typical strings 
of length n 


(A.1) ¥,(8) = (x16(x") € J,(9)}, 


where 6(x) denotes an estimate of 0 satisfying the assumptions in the theorem. 
Let P (8) denote the probability under P, of the strings that fall within Y,(8), 
which is the same as the probability that (6(x") — @)¥n falls within a neighbor- 
hood of radius log n about the origin. By the assumption, this probability exceeds 
1 — &(n), where 5(n) — 0. 

Before continuing, we give the gist of the proof which is really quite simple. 
Let the parameter have only one component and let Q be divided into N equal 
segments J,(6,), t= 1,..., N, of length 2r,. (Regard the given choice for the 
radius as a good guess.) Now, if a density g exists whose Kullback distance from 
fe, is short, then the probability mass given by g to the set Y,(9,) must exceed a 
certain amount, depending on how short a distance we demand. The same holds 
for 8, 95,.... But there is only the total mass of unity available, and it so 
happens that g can be as close as stated only to a preciously few cf the densities 
fe which is what the theorem in essence states. A critical point in the proof is to 
make sure that the sets of strings for which g has to distribute its mass are 
indeed disjoint. The statement of the theorem is such that it is not enough to 
consider the sets Y,(@), which, of course, are disjoint by their very definition, but 
we have to add probabilities of sets of strings with different lengths. 

We need to define another smaller set of typical sequences, namely, the set 


Xn (9) = (xt © Ypa (0), 7 =0,1,..., é}. 


In words, this is the subset of sequences of Y,,,(@) such that not only the 
sequences themselves are typical but also all their prefixes of length ranging from 
n ton + t are typical as well. We need to estimate the probability P, (6) of this 
subset. Let Z,(@), for j in the range 0 < j < t, denote the set of sequences x”*‘ in 
Y„ (8) with the property that the prefix x”* of x”** is the first which is not in 
Y,+,(9). Hence, all the shorter prefixes x”,...,x"*/~! belong to 
¥,+0(9),.--, Yis,-1(8), respectively. The case j = 0 means that already the first 
prefix of length n does not belong to Y,(8). It is clear that the union of Z,(@) over 


1096 J. RISSANEN 


J is precisely the difference Y,, (@) — X, ,(8). Because ô(n) is summable, we 
deduce first that 


(A.2) Pe(Yuu(8) = Xu,(8)} < È BC) = pln), 
and then 
(A.3) P (8) >1—8(n) — p(n), 


where 6(n) + p(n) > 0. 
Let g(x) be any density function defined on the data sequences, as stated in 
os theorem, and denote by Q, (0) the so induced probability over the set 
X,, (9). Then by the nonnegativity of the Kullback distance between the 
densities g(x)/Q,, (8) and fy(x)/P,, (8), we get 


f(x) P, (9) 

fo(x )log dx > P, ,(8)lo 

Laa P gay > Pa M08 D™ ay 

Pick a positive integer m, and let A,,(n, t) be the set of 8 such that the left hand 
side of (A.4), denoted by T,, (6), satisfies the inequality 


(A.5) T, (8) < [1 P = hog((n +1)*?), 


We wish to calculate an upper bound for the volume V (n) of B,(n) = 
A,,(n,0) + A,(,1) + ++: . To this end, let N(n,0) denote the maximum num- 
ber of disjoint neighborhoods J,(@) that can be constructed such that their 
centers 0 lie in A,(n,0). Let C(n,0) denote the set of the centers. These 
neighborhoods may not cover A,,(n,0), because there may be points that are too 
close to some of the constructed neighborhoods without being covered by any. 
However, if we double the radius of each neighborhood in the maximal collection 
we get a cover S(7,0) for A,,(m,0). Recursively, let N, , denote the maximum 
number of disjoint neighborhoods J,,, (8) that can be constructed such that 
their centers lie in the difference set A,(n, t) — S(n, t — 1). Let C(n, t) denote 
the set of the centers. This construction means, obviously, that if @ is in C(n, i) 
and @’ in C(n, t), for i < t, then the distance from 6’ to J,,,(@) is not smaller 
than (log(n + 2))/ yn + i. As above, the neighborhoods J, , (8), for 6 in C(n, t), 
may not cover the difference set, but by doubling their radius the union of the 
resulting expanded neighborhoods together with S(n, t — 1) gives a cover S(n, t) 
for B (n, t) = A,,(n,0) + -+ +A,,(n, t). Hence, the volume V_(n, t) of B,(n, £) 
is bounded by 


(A.4) 








yn yn+t 


where K is a constant. We presently derive an upper bound for the right-hand 
side. 


logn\* log(n + t)\* 
(A.6) V(t) < KN, a| e) RENE oe | 


STOCHASTIC COMPLEXITY AND MODELING 1097 


From (A.4) and (A.5) we conclude that 
1—1/m log P, (0) 
P.)  log((n+ t)*”) 


for 0 in A,,(n, t). By picking n large enough we can by (A.3) make P, (0) so 
close to unity that the expression within the brackets is less than some number £, 
such that 0 < 8 < 1, uniformly in ¢t and @. Hence, 


(A.8) Qn, (8) > (n+ t) P, 


which holds for 0 in A,,(n, £) and n larger than some number. 
We wish to prove the inequality 


(A.9) 12 L Qn,0(9) a ea L Q,, (9), 
6€C(n,0) 6EC(n, t) 

for all ¢. Each sum is a probability, induced by the density g, of a set of strings 
with length varying from sum to sum, and we cannot directly claim the in- 
equality. However, consider the following. The neighborhoods -J,(@) for @ in 
C(n,0) are disjoint by construction, which makes the corresponding N, , sets 
X,,,o(9) disjoint. Hence, surely, the first sum in (A.9) does not exceed unity. For 6 
in C(n, 1), consider the set of the prefixes with length n of the strings in X,, (9). 
Denote this subset of X, (8) by U, (8). The neighborhoods J,,,,(@) for 8 in 
C(n, 1) are not only disjoint from each other and from all of the neighborhoods 
J,(9’) for 8’ in C(n, 0), but since the distance from 6 to any neighborhood J,(6’), 
6’ in C(n,0), exceeds (log n)/¥n, also none of the larger neighbor- 
hoods J,(@) for 6 in C(n, 1) intersects any of the neighborhoods J,(6’), for 8’ in 
C(n, 0). This means that for 0 in C(n, 1) the set X,, (9) does not intersect any of 
the sets X,, (8°), 0’ in C(n,0), which appear in the first sum of (A.9). Therefore, 
because g is a density satisfying the compatibility conditions for a random 
process, 


(A.7) — logQ, (8) < log((n + a); 





2 Qa (8) s Q U U,.(0)} 
dEC(n, 1) 6€C(n,1) 

where Q denotes the probability measure defined by the density g, and (A.9) 
holds for ¢ = 1. The same arguments apply for any t, because, as we explained 
above in the paragraph preceding (A.6), if @ is in C(n,i) and 8’ in C(n, t), 
for i<t, then the distance from 6’ to J,,,(@) is not smaller than 
(log(n + i))/ Yn + i. Therefore, for 6 in C(n, t) and 8’ in C(n,i), i < t and 
6 + 0’, the sets X, (8) and X, ,(8’) consist of strings with length n + ¢ and 
n + i, respectively, such that their prefix sets of length n, U,, ,(@) and Un ,(8’) 
are disjoint. This in turn means that the two sets of strings of length n, 
Us eC(a, Wn, (9) and Uren eeca, Wh, (9), are disjoint, and (A.9) follows. 

We now put the various pieces together to conclude the proof. From (A.6) we 
get first 





t 
V,(n,t) < K YN, (n +i)" (n + i) ogl + i))*. 


i=] 


1098 J. RISSANEN 
Because (log n)/ yn is eventually monotone decreasing, we also have eventually 


t 
V,(n, t) < Kn*®-YA(log aE N, (7m + os 
(A.10) 1=0 


< Kn*8-YA(log n)*, 


where the last inequality results from (A.8) and (A.9). This holds for all ¢ and all 
sufficiently large values for n. Hence the monotone increasing sequence V,,(n, t), 
t= 1,2,... has a limit, which is V_(7), still bounded from above by the right- 
most term in (A.10) for all sufficiently large n. Because this term converges to 
zero as n grows to infinity, so does V (n). 

We have shown that the measure of the set limsupA,(n, t) is zero, or, 
equivalently, that for all m, the measure of the set of parameters 0 for which all 
of the inequalities 


(A11) Tyo > (1-1/m)log(n*”), = -'T, , > (1 — 1/m)log((n + 1)*”)..., 


hold for some value of n, is the measure of *. Consider the inequality log y > 
a(l — (1/y)), where a = 1/ln b, b being the base of the logarithm. By putting 
y = f(x)/g(x) we get further 





f(x) 
x )lo 

I seal Jlog g(x 
where A denotes the complement of the set A. Using this we get for the points 8 
where (A.11) hold, 


dx > —a, 


~~ > Tp e a> (1— 1/m)log((n + t)*”) — a. 


Since for each such 6 these hold for some n and all ¢, the statement in the 
theorem follows. 0 


APPENDIX B 


Proor oF THEOREM 3. Write g(9,w) = —In fp (w) for short, where w 
ranges over the reals, and denote the row vector of the first partials by 
Ag(@, w) = dg(0, w)/30 while J(8, w) denotes the matrix of the double deriva- 
tives of g. We now use natural logarithms. Then for each & the inequality 


n-l 
Igp(x) s — Yin fe, d(T) + C} 
0 


holds by the definition of the semi-predictive complexity, where C, = In 2!°8**, 
Further, with ô, = 6(t) — 0 we get from Taylor’s expansion of g(0, x) about 6 


n=l 
Isp(x) + In fy.o(x) < X [Ag(0, x141); + $8/I(8, 2441)8, 


t=0 


+R(6(t), x,41)| + Cy, 


STOCHASTIC COMPLEXITY AND MODELING 1089 


where 6(t) is a point on the line segment connecting 6(t) and 0, and the 
remainder term R(6(t), *,41) is proportional to a sum of terms of type 


gO, 441) R 
30, 30, 30, (8t) — 0)(8,(2) = OXE) - 8), 


all evaluated at 6(t). These triple partials are uniformly bounded in the compact 
subset 9*4. Since the moments of ô, converge, we get by Schwarz’s inequality that 
E, o|R(8(t), x, 1)| < Kt~*”, for some constant K. Further, 


Ey, o(8,5;) = (J)? + t7p,, 


where E, gJ(9, x,,,) = J and p, > 0. Hence, by using the independence of x,,, 
and 6(t) together with the fact that the mean of the latter is 0, we get 


n 


-1 k 1 
n Ey, o[lsp(2) +n fr o(*)] S Qn = t + Tns 
t=1 


where 


1 2 /\pil C, 
— —+Kr*?)+ —. 
ial < 2n z| t l n 
The sum of the harmonic series is Inn + »,, where »,/Inn > 0. With a little 

struggling one sees that also nr,„/ln n > 0, which completes the proof. 0 


Acknowledgment., An anonymous referee deserves credit for carefully read- 
ing several versions of this paper and providing many helpful suggestions. 


REFERENCES 


AKAIKE, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. 
Control AC-19 716-723. 

BENTLEY, J. L. and Yao, A. C. (1976). An almost optimal algorithm for unbounded searching. 
Inform. Process. Lett. § 82-87. 

CHAITIN, G. J. (1975). A theory of program size formally identical to information theory. J. Assoc. 
Comput. Mach. 22 329-340. 

Dawip, A. P. (1984). Present position and potential developments: some personal views, statistical 
theory, the prequential approach. J. Roy. Statist. Soc. Ser. A 147 273-292. 

GEISSER, S. and Eppy, W. (1979). A predictive approach to model selection. J. Amer. Statist. 
Assoc. 74 153-160. 

GOKHALE, D. V. and KULLBACK, S. (1978). The Information ın Contingency Tables. Dekker, New 
York. 

HJORTH, U. (1982). Model selection and forward validation. Scand. J. Statist. 9 95-105. 

KoLmMoaorov, A. N. (1965). Three approaches to the quantitative definition of information. Prob- 
lems Inform. Transmission 1 4-7. 

LINDLEY, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math. 
Statist. 27 986-1005. 

RISSANEN, J. (1978). Modeling by shortest data description. Automatica 14 485—471. 

RISSANEN, J. (1983). A universal prior for integers and estimation by minimum description length. 
Ann. Statist. 11 416-431. 

RISSANEN, J. (1984a). Universal coding, information, prediction, and estimation. IEEE Trans. 
Inform. Theory IT-30 629-636. 


1100 J. RISSANEN 


RISSANEN, J. (1984b). A predictive least squares principle. IMA J. of Math. Control and Informa- 
tion. To appear. 

RISSANEN, J. (1985). Minimum description length principle. In Encyclopedia of Statistical Scrences 
(S. Kotz and N. L. Johnson, eds.) § 523-627. Wiley, New York. 

RISSANEN, J. (1986a). Order estimation by accumulated prediction errors. In Essays in Time Seres 
and Allied Processes (J. Gani and M. B. Priestley, eds.) 55-61. Applied Probability Trust, 
Sheffield, England. 

RISSANEN, J. (1986b). Complexity of strings in the class of Markov sources. IEEE Trans. Inform. 
Theory IT-32 526-632. 

Scuwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 8 461-464. 

SoLomonorr, R. J. (1964). A formal theory of inductive inference. Part I. Inform. and Control 7 
1-22; Part II. Inform. and Control 7 224-254. 

STONE, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. 
Soc. Ser. B 86 111-147. 

Strong, M. (1977a). Asymptotics for and against cross-validation. Biometrika 64 29-35. 

Srong, M. (1977b). An asymptotic equivalence of choice of model by cross-validation and Akaike'’s 
criterion. J. Roy. Statist. Soc. Ser. B 39 44—47. 


IBM ALMADEN RESEARCH CENTER 
650 Harry ROAD 
SAN JOSE, CALIFORNIA 95120 


The Annals of Statistics 
1986, Vol. 14, No 3, 1101-1112 


ASYMPTOTIC OPTIMALITY OF C, AND GENERALIZED 
CROSS-VALIDATION IN RIDGE REGRESSION WITH 
APPLICATION TO SPLINE SMOOTHING! 


By Ker-Cuau Li 
University of California, Los Angeles 


The asymptotic optimality of Mallows’ C, and generalized cross-valida- 
tion is demonstrated in the setting of ridge regression. An application 1s made 
to spline smoothing in nonparametric regression. A counterexample is g-ven to 
help understand why sometimes GCV may not be asymptotically optimal. 
The coefficient of variation for the eigenvalues of the information matrix 
must be large in order to guarantee the optimality of GCV. The proof is based 
on the connection between GCV and Stein’s unbiased risk estimate. 


1. Introduction. Suppose that we observe n independent normal random 
variables Me i=1,2,...,n, each associated with p, explanatory variables; 
Moy ay sia . In ridge. regression, we may estimate the mean p,, = (Bis. e nY 
of y, = o er yy by fi,(h) = X,(X,X,, + AD 'X p Yn where X,, is the n X p, 
design matrix (x,,). The choice of ridge parameter h is crucial and many 
procedures have been proposed. Two of them, namely C, (Mallows, 1973) and 
GCV (Craven and Wahba, 1979), will be studied here. 

Let o? be the common variance of y, and put M,(h) = X,(X/X,, + AI) X, 
C, selects h by minimizing 


(1.4) nly, — fig() |? + 20% "tr M (h). 


Subtracting oĉ, (1.1) gives an unbiased estimate of the risk R,(h) = 
En“'"I\p, — (hÈ. If o? is unknown, then it has to be replaced by an estimate 
6”, Thus the stability of 6? may influence the performance of C,. Generalized 
cross-validation (GCV) does not need o°. It selects A by minimizing 


nyn z i,(h)|\? 
(1 —n“trM,(h))” 


Let hy and hg denote the h selected by C, and GCV, respectively. Put 
Lah) = n7"|\p, — f,(%)||?. We shall show that they are asymptotically optimal 
(a.o.) in the sense that as n > oo, 


1.3) L,(h) 


inf,» oL,(4) 
where h = hy hg. 


(1.2) 


> 1, in probability, 


Received October 1984; revised August 1985. 

‘This work was supported by the National Science Foundation under Grant No. MCS-8200631. 

AMS 1980 subject classifications. 62G05, 62G99. 

Key words and phrases. C,,, generalized cross-validation, ridge regression, smoothing solina; Stein 
estimates, Stein’s unbiased risk estimates. 


1101 


1102 K.-C. LI 


The following is the only condition needed for C, to be a.o: 
(A.1) inf nR,(h) > œ. 
h20 


There are no explicit restrictions on the sequence of design matrices X,„. But an 
implicit one more or less implied by (A.1) is that p, tends to œ as n > oo. 
Without (A.1), it seems that no selection procedure can be a.o.; otherwise the 
resulting estimates may possess unattainably small risk. 

The result for GCV requires, in addition to (A.1), a certain condition on the 
eigenvalues À; 2A, > +++ 2A, 2 0 of the information matrix X; X,„. Roughly 
speaking, the. coefficient of variation for the A,’s should tend to infinity as 
n — œ (see (A.2) of Section 3) and hence make the problem ill-posed. We also 
provide an example to show that if the spread of these eigenvalues does not tend 
to infinity, then GCV may not be a.o. 

As an application, we consider the spline smoothing problem. Suppose u, = 
f(x,), with the unknown function f e W*[0,1] = {f: f has absolutely continu- 
ous derivatives, f’,...,f-) and {if (x)? dx < œ} and x, € [0,1]. The 
smoothing spline estimate f; of f is the solution of 


min n` E (x, ~f(x,))* + h f FO)? de. 
feWF0,1] s1 0 

It is well known that f,(h) =(f,(x,),-.., f,(x,))’ takes the form of ridge 

regression with the first k eigenvalues being + co (see, e.g., Li (1985)). We shall 

show that C, and GCV are both a.o. if f is not a polynomial of degree k — 1 or 

less. 

For spline smoothing, there have been some results in the literature that are 
related to the a.o. property of GCV, mostly due to Wahba and her collaborators. 
Let Ag be the minimizer of the expectation of (1.2) over h > 0. It was shown in 
Craven and Wahba (1979) that 


(1.4) R,(hg)/inf R, (h) > 1. 


See Wahba (1985) for more information. However, it is clear that the results of 
this type do not necessarily lead to the a.o. of (1.3). For example, if f is a 
polynomial of degree k — 1, then (1.3) cannot hold for any selection procedure; 
but it is easy to see that (1.4) holds. The big gap between (1.3) and (1.4) was 
closed significantly by Speckman (1982) who established (1.3) under the assump- 
tion that he is selected by minimizing (1.2) over h in some closed interval that 
converges to 0 in some fashion as n tends to infinity. Speckman’s result was 
derived (by Cox (1983)) without the normality assumption. Erdal (1983) and 
Golub, Heath, and Wahba (1979) discussed the properties of GCV for general 
ridge regression. 

Our method in proving (1.3) for GCV is based on the connection between 
Stein’s unbiased risk estimate (Stein, 1981) and GCV. This connection has been 
used to demonstrate the consistency of GCV in many settings (Li, 1985). The a.o. 
of C, and GCV in the discrete index set case, such as model-selection or nearest 
neighbor nonparametric regression, was established in Li (1984). 


RIDGE REGRESSION 1103 
2. Mallows’ C}. In this section we shall prove the following theorem. 
THEOREM 1. Under (A.1), (1.3) holds for h = hy. 


We shall assume that X, is diagonal, i.e., x,, = Ofori + j and x, = X'Z?. This 
is without loss of generality because after a suitable orthogonal transformation 
we can reduce any X, to a diagonal form without changing the error distribution 
(due to normality). Now M,(h) is simply an n Xn diagonal matrix with 
à (h + A,)7' as the ith diagonal element. Here we put A, = 0 for i > p,. Note 
that A, may depend on n. 

Let e, = (€i, @o,--+,€,)’ = Yn — Pn and A,(h) = I — M, (h). Clearly, 


nly, — â(h)|? + 202 tr M,(A) 
= nell? + L, (2) + 2n7(e,, A,(A)ia) 

+2n-(o*trM,(h) — (en, M). 
Therefore, it is enough to show that in probability, 
(2.1) sup |n- (en, Ag(A)ia)|/ (A) > 0, 

= 
(2.2) supn™'|o?trM,(h) ~ (en, M,(h)e,)|/R,( A) > 0, 

h20 
and 
(2.3) sup |L,(4)/R a(h) - 1|> 0. 
The following useful ies is recalled from Li (1985). It first appeared in 

Speckman (1982, 1985). 


LEMMA 2.1. Assume that W, i= 1,2,..., n, are independent random vari- 
-ables with means 0 and finite second moments. Then for any 5 > (0, we have 
ms 2 
P| sup È ¢W, |= | < a=taže| > w 
OSAS © ShSaji=l 


E=] 
If the W,’s have finite fourth moments, then 
n 4 
P| sup > | < od w] - 
- \O<aqs- scesa 


tim] 
We begin to prove (2.1). Put B,(A) = D7 ,n2(A, + h). Since nR,(h) = 
h’B,(h), it is enough to show that 


Lew,(A, + a) | / B,(A)'"(nB,(A))'” > 0. 
1m] 

For each n, let I(Jy= {1,2,..., J} and [,(/) = {7 + 1,..., n}. Define n to be 
the largest i such that A, # 0. Put Q„ = inf} > onR,„(h) and V,(h) = ERGA, + 
h)~. Clearly (2.1’) will Kold if we can show that for any natural number k and 


n 








n 


È cW, 


w=] 








(2.1) sup 
hz0 





1104 K.-C, LI 
for l = 1,2, 


(2.4) sup 
har, 





Z ep (a, + >| notar 0, 


telk) 
and that for any e > 0, there exist constants c,(e),Co(e) such that 








P| sup sup | È ep (à +h) [rava e 
(2.5) J=k, A Ap SASA lieli 
wo 
s c (e) = 5 et 
J=k 
for l = 1,2. 


PROOF OF (2.4). When / = 1, the left side of (2.4) does not exceed 
k 
Qr? max |e) È sup |#,|(A, + 2) '/B,(A)'” < Q7'” max Jee, 
lsisk tel Aza, lsisk 


which tends to 0 because of (A.1), as desired. 
When I = 2, it suffices to show that for any e > 0, 


(2.6) P| sup ` cao t D |an ea | > 0. 


haa, tak+} 


Since A?B,(h) is nondecreasing in h, the left side of (2.6) does not exceed 
[urone | 





n 


P| sup} È emh, +h)’ 


hzňp|1=k+1 


< P| 


+ P| sup 
hor, 








n 


2 et, ACA, t Ap) 


tmk+1 








> 1A B,0) 7e] 


n 


E em (hC, + A) ACA, +4) ’) 


tmk+1 








2 ih.) EA), 
By Chebyshev’s inequality, the first term of the last expression does not exceed 


(seBy(a,)04") E| 2 em (A, +A) 
t=k+] 

because of (A.1). The second term is also no greater than 42~*Q7 10? due to 
Lemma 2.1. To see this, observe that 


ACA, + AY = AKOA, + Ag) = (A+ AR) (A AADAC, + A) 
and that for i > k + 1 and h > Ap, (A—A,)A,(A, + A)7! is nonincreasing in i 
` and is no greater than A,. Now set W,=e,n,(A,+A,)7', a@=A,, and ô= 
(e/2)A,B,(A,)'7Q'” in Lemma 2.1 to yield the desired bound. Therefore (2.4) is 
proved. 0 


2 
< 4e°-°Q7'0? > 0, 





RIDGE REGRESSION 1105 


Proor oF (2.5). For l= 1, since B,(h) and V (h) are both nonincreasing in 
h, the left side of (2.5) does not exceed 


Bal sup Tenli, +h)” | aA VA, 2 | 





jak A+ SASÀ, |m] 


J 
Een (à, +A) 2 
1m] 


= P| 


jek 








$eB,( A, ) y A, ) “| 


+ BP sup 
A 


jrk 41 Shsh, 





E ew((r, +h)" - (AÀ, +A) ) 


tI] 


> leB (A) VA; K 
By Chebyshev’s inequality, the first term of the last expression does not exceed 


J 4 
(2.7) > , (žeB,(A A) Ve, yo) E| E emt ta] . 
yok =l] 

The second term is also no greater than (2.7). To see this, observe that 
(A, +A (A, +AT = (A, + ADTA, — AXA, + A)? and that for 
Ai SASA, and i<j, (A,— AXA, + h) 1 is nondecreasing in i and is 
no greater than 1. Now in Lemma 2.1 set W, =e (A,+A,)7', a=1, and 
ê = $eB,(\,)'V,(A,)'” to yield the desired bound. Now since 


s|} > e,u;(A, + a) < of 3 E (A, + wy} 


wy] r=] 





for some constant C, (2.7) does not exceed 16Ce~ ere (A,)~?. Finally, it is 
clear fet for A; # 0, V,(A,) = LMA, + A,)~? = 1. Thus (2. 5) is established 
for l = 

ete to the case l = 2, since h?B,(h) is nondecreasing in h, the left side of 
(2.5) does not exceed 
i 
> P| sup 

A 


juk ytishsh, 


n 


L ep ACA, +F D| aBn nA 2 | 


ryt] 





< A 


jak 





2 eA aCA, + MT |ia na 2. | 


peytl 


n 


hi 
+ » P| sup : 2 eat RO, F h)’ 7 Ayal, + Apa1) | 
imj+ 


yuk ài SASÀ, 





= 4e, Bal Aja Po . 


1106 K.-C. LI 


As before, using Chebyshev’s inequality for the first term and Lemma 2.1 
for the second term, we may obtain the desired bound. Note that when using 
Lemma 2.1, we observe that A(A, + h)! — AÀ, + A)T = (A, + 
Aja) TA (A z ADA + hy”? and set W, = eB (À, + ADS a= Às and 
5 = ted), ,B,(A,4,)'V,(A,)'. The details are omitted. This completes the 
proof of (2.5). Hence (2.1) is established. O 


To prove (2.2), it suffices to show that 
n 


E (0? ~ e?)a,(A, +4) | / V,(h)'"(nR,(h))'? = 0. 


tl] 


(2.2’) sup 
h20 





Now compare (2.2’) with (2.1’). By the correspondence of o? — e? to e, A, to 
L and V (h) to Bh)”, it is clear that (2.2’) holds by similar arguments. 
It remains to establish (2.3). Clearly we need only to prove that 








(2.8) sup| È e,u,A,(A, +A)” JBR =o 
h20ļ|:=1 
and that 
n 
(2.9)  sup| X (0? ~ e?) X (A, + h)” / V,(h)'?(nR,(A))'” > 0. 
hz0ļ|:=1 








The proof of (2.8) will be similar to that of (2.1’). First it is enough to show 
that for any fixed natural number k, and for / = 1, 2, 


(2.10) sup / B,(h)'Q1? > 0, 


haar, 





È emà (à +h)? 
tel(k) 





and that for any e > 0, there exist constants c,(e) and c,(e) such that for J = 1, 2, 


2 e,t, A (A, t h)? 
ER) 








P sup sup 
mR, RASAS, 


/ B (h) Vh)” > e 


< c(e) yy. i 


juk 


PRooF OF (2.10). For Z= 1, since A,/(A, + A) < 1, the proof is exactly the 
same as in (2.4). For Z = 2, the analogue of (2.6) is 


E ep, h(a, +h)” 


tm k+1 


(2.12) P| sup 
hoary 








/ hB, (hY Q2 > | > 0. 


RIDGE REGRESSION 1107 


Now the left side of (2.12) does not exceed 


Jurar 2 | 


E em, hà (à, + A)? 


wek+1 


<P| 


+P sup 
Ash 


P: sup 
hod, 








n 


E eA (A, + Ag)? 


t=Å+1 


juror > i 








E em,A,(A(A, + hk)? AA, + A,)”) 


im k+l 








> pB]. 


Now by Chebyshev’s inequality and noting that A,/(A, + A,) < 1, the first term 
does not exceed 4e~*Q 7's? —> 0. The second term can also be shown to be no 
greater than 4e7?Q; 'o? due to Lemma 2.1. Here we observe that A(A, + h)~? — 
Aal, + Ag)? = (A, + A)T h — AAMAS — AA,MA, + h)-2 and that for 
i>k+1land h2 A, (h—A,)\(RA, — AXA, + h)? is nondecreasing in i and 
is no greater than A,. Thus setting W, = —e,u,A,(A, + à)? and a= À, in 
Lemma 2.1 we obtain the upper bound 4A%07D7_,,,m2A2(A, + 

Ap) 7 4/e?.B,(A,)Q,, Which is no greater than 407e~*Q7' as desired. This 
completes the proof of (2.10). 0 


PROOF OF (2.11). For l= 1, the left side of (2.11) does not exceed 


Jaa nan? : | 


B,(A,)?V,(A,) Z > ie), 


J 


e ACÀ, + A,) ok 
1 





Jak 


(m 


Èe 





Tae eaS 


t=] 


+E z| sup 
à; 


+1 SRSA, 








By Chebyshev’s inequality and Lemma 2.1 again, both terms in the above 
expression are bounded by some constant times (2.7). Here we observe that 


(A, +A)? = (ALHA) 7 = (A, +A,) (A, — A)(2A, +A, +A), +A)? 
and that for i<j and h <À, (A,— AX2A,+A,+ kA, + A)~? is nonincreas- 


ing in i and is not greater than 3. Put W, = e,n,A,(A, + A,)~? and a = 3 to yield 
the desired bound. 


1108 K.-C. LI 


Turning to the case l = 2, the left side of (2.11) does not exceed 


D | [inn na >] 


yok 


n 


L e Ayr A (A, + Ava) 


t=j+1 


—2 








n 


È em ALAA, +h)? Aalt Ajax) 


t=j+ 1 


A 
+ È | sup 
A 


J=k yar Shs, 








djyarBu(Ajar) VANI” 2 if. 


These two terms are both no greater than (2.7). Here note that W,= 
—epA(A,+A;,,)-? and a = À}; when using Lemma 2.1. This completes the 
proof of (2.11). (2.8) is now established. 0 


Finally, comparing (2.9) with (2.8), we see that the former can be proved in a 
similar way. Hence (2.3) is established. The proof of Theorem 1 is now complete. 


3. Stein estimates and GCV. Consider the following simplified version of 
Stein estimates and the associated unbiased risk estimate, 
i,(h) =y,-07trA,(h)|A,(A)¥ql| An(A)Yn 
and 
SURE,(h) = 0? — o4(trA,(h))’/nl]4,(A)yall 


where A,(h) = I — M,(h). Clearly he minimizes SURE,(h) over h > 0. Li 
1985) has shown that SURE (Åg) is a consistent estimate of the true loss 
(Ag) = n'un — fi,(Ag)\|? essentially without any assumptions on the ma- 

trix M,(h) and p,. With (A.1) and other conditions to be given, we may 

strengthen this result. 


PROPOSITION 3.1. Under (A.1), for any h, random or not, such that 


(3.1) (ntr M,(h))'/n'tr M2(h) > 0 

and 

(3.2) n-,(h)yall > 0, 

we have 

(3.3) |SURE,(A) os L,(h) ai n~"te, II? F o?|/L,(h) >0 
and 


(3.4) n= NECA) = ag(8) °/L,(f) > 0. 


Proor. Rewrite the left side of (3.3) as 


RIDGE REGRESSION 1109 
o*trA,(h) o4(tr A,(h))” 

—— 7 rA h a) SS 

nla,(h)y, [Pr AO ml 


/ L,(h) 
< 20°trA,(h)|(eq, ACA) l/a] Ag h)¥a | LA) 


+20%trA,(h)|(e,, M,(R)e,) — o%tr M,(A)|/n A (A) LA) 
o*trA,(h) 7 


eC L,(h). 
lAl / a 


Now by (2.1)-(2.3) and (3.2), the first two terms of the last expression tend to 0. 
To show that the third also converges to 0, it is enough to prove 


[o?n tr A (A) — n-A,(A)y_ | |lo? — n'el?) > 0. 


Since ||A,,(/)yql|? = lenll? + len An(A)bn) — 2¢e,, M,(A)e,) + LÀ), the first 
absolute value factor in the last expression does not exceed 


|o? — n"e,li?| + La(h) + 2n-{(e,, A,n(h)pn)| 
+2n—'l(e,, M (hjen) — o7trM,(h)| + n-'o?trM,“h). 
Thus by (2.1) and (2.2) again, it remains to show that 


n` lenll? + o? 














iee? — n` 'lle,l?) 





(3.5) (0? — n“Ye,||?)'/L,(h) > 0 
and 
(3.6) (n-'trM,(h))|o? — n-Ye,iI?|/L,(2) > 0. 


By the central limit theorem, (A.1), and (2.3), we get (3.5). Finally by (3.5) and 
(2.3), (3.6) holds because (n~'tr M,(h))? < R,(h). Hence we have proved (3.3). 
The proof of (3.4) is omitted since it is similar to the proof of (6.7) of Li (1984) 
(see also Li and Hwang (1984) for the case where À is nonrandom). 0 

(3.2) is equivalent to the consistency of i(k), i.e., 
(3.7) L,(h) > 0. 


This condition may imply (8.1) if we assume the following condition on the 
asymptotic distribution of the eigenvalues A: 


For any m such that m/n — 0, we have 
(A.2) 1 2 2ji an 
E [5B Xoo, 
N emt N =m+1 


LEMMA 3.1. Under (A.1) and (A.2), (3.7) implies (3.1). 


1110 K.-C. LI 


Proor. Define # =i if \,,, < A < A, Clearly we have 


n n 
nla + 2 NA] <sn™E [A/A +A) 
(3.8) m+] :=1 
< nla + > NANa 
r=m+] 
for / = 1,2. From this it follows that (3.1) is equivalent to 
n 2 n 
(3.9) aie + ¥ x.) nia + 5 x] > 0. 
t=mh+1 t=mt+1 


On the other hand due to (2.3), (3.7) implies that R,(h) > 0, which in turn 
implies n~'trM2(h) > 0. Hence by (3.8), 7/n > 0. Now it can be seen that 
(3.9) follows from (A.2). This completes the proof of Lemma 3.1. O 


We are ready to prove the following main result of this section. 


THEOREM 2. Assume that (A.1), (A.2), and the following condition hold: 

(A.3) inf L,(h) > 0. 
hz0 

Then hg is a.o. Moreover L,(hg)/L,(hg) > 1. 

Proor. Let h* be the minimizer of the left side of (A.3). Then by Lemma 3.1 
and Proposition 3.1, we have 
(3.10) SURE,(A*) — n“ Ie, ? + o? = L,(h*)(1+0,(1)). 7 
On the other hand, by Theorem 5.4 of Li (1985), (3.7) holds for k= hg. 
Therefore, we also have 
(3.11) SURE,(hg) - n~ 'le,ll? + 0? = L,(Ag)(1 + 0,(1)). 


Since SURE,(fg) < SURE,(h*), Theorem 2 is now proved by comparing (3.10) 
with (3.11). 0 


We may apply Theorem 2 to the problem of spline smoothing. For instance, if 
x,’8 are equispaced, then Craven and Wahba (1979) showed that A, = ci~?*, for 
some constant c. Now 








Ly n-m _ 
— YA, =e aaf x7? dx = e(2k — 1) nm ?*t 
N am n m/n 
and 
1 5: A2 n= m ne -4k dx (4k iy? —ly,—-44+1 
— =e n = a n . 
z í 7 fg c m 


=m 


Hence (A.2) holds. (A.3) is guaranteed by the existence of a consistent estimate of 


RIDGE REGRESSION 1111 


f. (A.1) will be satisfied unless f is a polynomial of degree k — 1 or less. Thus we 
have 


COROLLARY. For the problem of spline smoothing, if f is not a polynomial of 
degree k — 1 or less, and x,’s are equispaced, then he is a.o. 


Finally we give an example to show that violating (A.2) may incur the 
inefficiency of hg. 


EXAMPLE. Let 
A, =A, = = jpa = m, 
Àm = =A, = 2, 
Any =À,=l, 
By tet = bema = n, 
and 
Hiner = t = Ba =O. 


For any h such that h> œ and h«n'/”, we have nR,(h) = o?(n! + 
2.5nh~*) + h?. Thus (A.1) and (A.3) hold. In fact, [inf,, »nF,(h)]/n'/(o? + 
10'/2¢) > 1, as n > œ and the minimizer h* = (2.5n07)'/4. On the other hand, 
(1.2) can be written as 


[nl] n/2 


(3.12) CD ÈE ¥+C,D Y e?+ OD £ e?, 
tm [441 n/2+1 
where 
C,=n(ht+n?)*, G=n(h+2) °,  Cy=n(h+1)7”, 
and 


D= (ni7(h + n2) + (in — n/?\(h + 3 a + in(h + ey. 


Now using Taylor’s expansion, for h such that A > oo and h « n)/*, we have 
C,D = n-*h? + o(n~*h?), 
C,D =n“"(1— Aol + tho? + an“? + o(a + h-*)), 


and 


C,D = nL +h — ER? 4+ 2n + ofh + ny). 


1112 K.-C. LI 


Substituting into (3.12), we obtain the leading terms 


nr a n n/2 
nh? y y? + nota bee rel J Fe 
n/2 


i=] 1/2 n/2 nin 
1 n 

Ai: x ei ‘a +2n-¥?) = n'h? + Ea 292 + B L et one"). 
nin 


Thus hg = veer Compared with h*, we see that hg i is not a.o. Note that 
the condition (5.6) of Li (1985) is satisfied and hence Agi is consistent. 


REFERENCES 


Cox, D. D. (1983). Personal communication. ; 

CRAVEN, P. and WauBa, G. (1979). Smoothing noisy data with spline functions: estimating the 
correct degree of smoothmg by the method of generalized cross-validation. Numer. Math. 
31 377-403. 

ERDAL, A. (1983). Cross validation for ridge regression and principal component analysis. Thesis, 
Div. of Applied Mathematics, Brown Univ. 

GOLUB, G., HEATH, M. and Wanna, G. (1979). Generalized cross-validation as a method for choosing 
a good ridge parameter. Technometrics 21 216-223. 

Li, K. C. (1985). From Stein’s unbiased risk estimates to the method of generalized cross-validation. 
Ann, Statist. 13 1362-1377. 

Li, K. C. (1984). Asymptotic optimality for Cp, Cz, cross-validation and generalized cross-validation: 
discrete index set. Unpublished. 

Li, K. C. and Hwane, J. (1984). The data smoothing aspect of Stein estimates. Ann. Statist. 12 
887-897. 

MAaLLows, C. L. (1973). Some comments on Cp. Technometrics 15 661-675. 

SPECKMAN, P. (1985). Spline smoothing and optimal rates of convergence in nonparametric regres- 
sion models. Ann. Statist. 13 970-983. 

SPECKMAN, P. (1982). Efficient nonparametric regression with cross-validated smoothing splines. 
Unpublished. 

STEIN, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statst. 9 
1135-1151. 

WAHBa, G. (1985). A comparison of GCV and GML for choosing the smoothing parameter in the 
generalized spline smoothing problem. Ann. Statist. 13 1378-1402. 


DEPARTMENT OF MATHEMATICS 
UNTVERSITY OF CALIFORNIA 
Los ANGELES, CALIFORNIA 90024 


The Annals of Statistics 
1986, Vol. 14, No. 3, 1118~1131 


ASYMPTOTICALLY MINIMAX ESTIMATORS FOR 
DISTRIBUTIONS WITH INCREASING FAILURE RATE! 


By JANE-LING WANG 
University of California, Davis, and University of Iowa 


We construct nonparametric estimators of the distribution function F 
and its hazard function in the class of all increasing failure rate (IFR) 
distributions. Denoting the empirical distribution and empirical hazard func- 
tion by F, and H,,, respectively, let C, be the greatest convex minorant of 
H,,, and G, the distribution with hazard function C,. The estimator G,, is 
itself IFR. We prove that under suitable restrictions on F, and for any fixed À 
with F(A) < 1, sup, <,n'7|C,(x) — H,(x)| and sup, .,n'/1G,(x) — Fi(x)} 
both tend to zero in probability. This means that G, and F, are asymptoti- 
cally n! equivalent. It follows from Millar (1979) that F, 1s asymptotically 
minimax among the class of all IFR distributions for a large class of loss 
functions. This property extends to our estimator G, under some restrictions. 


1. Introduction. This paper deals with estimation of a cumulative distribu- 
tion function and its hazard function. Given a set of observations X,,..., X,, 
from a common distribution function F, the most standard nonparametric 
estimator of F is the empirical cumulative distribution function F,. This estima- 
tor F, of F was proved by Dvoretzky, Kiefer, and Wolfowitz (1956) to be 
asymptotically minimax among the collection of all (continuous) distribution 
functions. Therefore, in the absence of additional information about the shape of 
F (except possibly that F is continuous), the empirical distribution function (or a 
continuous version of it) is the optimal estimator for the true distribution 
function F in the asymptotically minimax sense. 

However, this does not solve the problem of optimal estimation if one has 
some information about the shape of the true distribution function. Kiefer and 
Wolfowitz (1976), motivated by questions arising in reliability theory, reopened 
the issue and proved that the empirical distribution function is still asymptoti- 
cally minimax among the class of all concave distribution functions. Note that a 
distribution function is called concave if it is concave on its interval of support. 
However F, not being concave itself, may be considered inappropriate for some 
purposes. The problem then is to construct an asymptotically minimax estimator 
which is concave. It follows immediately from Marshall’s lemma that the least 
concave majorant (LCM) C, of F, satisfies 


sup|C,(¢) ~ F(t)|< sup |F,(t) — F(t)|. 


Received March 1983; revised November 1985. 

! This research was supported in part by National Science Foundation Grant MCS 78-25301 and 
Army Research Office Contract DAAC-29-79-C-0093, and was completed in partial fulfillment of 
requirements for the Ph.D. degree at the University of California, Berkeley. 

AMS 1980 subject classifications. Primary 62G05, 62N06; secondary 62G20. 

Key words and phrases. Asymptotically minimax estimator, increasing failure rate, greatest 
convex minorant, (empirical) hazard function. 


1113 


1114 J.-L. WANG 


This indicates that the LCM C, of F, is a better estimator of F than F,, when 
one considers loss functions of the Kolmogorov-Smirnov type. It is also the 
maximum likelihood estimator as mentioned in Grenander (1956). The case of 
convex distribution functions can be dealt with similarly. 

One interesting family of distribution functions occurring in reliability theory 
is the family of distributions with increasing failure rate (IFR). Millar (1979) 
proved that F, is asymptotically minimax among the class of all IFR distribution 
functions. However, F, has the drawback that it does not have IFR. The 
maximum likelihood estimate in this family was found by Grenander (1956). 
Marshall and Proschan (1965) proved the strong consistency of the MLE and 
Prakasa Rao (1970) obtained the asymptotic distribution of the MLE of the 
density function. However, since no estimator of the distribution function in the 
IFR family has previously been shown to achieve optimality of any kind, it is of 
interest to look for optimal estimators which have IFR. The fact that F, is an 
optimal estimator of F and its hazard function H,„ (defined in Section 2) is not 
convex, leads us to consider modifying H,, into a convex function and then to 
compare the performance of H, to its modified version. Denote a,(F’) = inf{x: 
F(x) > 0} and a,(F) = sup{x: F(x) < 1}. If one tries to measure the goodness of 
fit of H, to H by sup, .,,|H,(¢) — H(¢)|, it is obvious that lim, _, ,..7),H(£) = œ, 
unless F has a jump at a,, and hence sup, < ..(7)|H,(t) — H(t)| = œ. Therefore, 
goodness of fit of an estimator of H should only be measured on intervals 
bounded away from a,(F). In this paper for any given number A < a,(F), 
we shall consider the problem of estimating H only on [0, A]. Motivated by 
the work of Kiefer and Wolfowitz (1976), we shall modify H, by its 
greatest convex minorant (GCM) C, on [0, A], and under certain restrictions 
sup, <,7'/*|H,(t) — C,(t)| will tend to zero in probability. 

Let G,(t) = 1 — e~© be the distribution with hazard funetion C,. This type 
of estimator (G,, and C,,) will be used throughout Sections 3 and 4. Several other 
estimators will also be discussed in Sections 3 and 5. 

Section 2 gives some notation and the formal construction of the estimators C, 
and G,- Using a series of lemmas, in Theorem 1 of Section 3 we show that under 
suitable assumptions, sup, .,7’/"|C,(t) — H,(t)| > 0 in probability. The proof 
follows the same pattern as the proof of Theorem 1 in Kiefer and Wolfowitz 
(1976). Later in Theorem 2 of Section 4, the assumptions of Theorem 1 are 
relaxed to only that of uniform convexity of the true distribution F. The 
asymptotic n equivalence of G, and F, then follows immediately from the fact 
that 


sup|G,(t) — F(t) = suplexp| —C,(¢)| - exp|—H,(¢)]| 


< sup|C,(¢) — H,,(#)I. 
tsr 


Notice that in the construction of our estimate C,, one does not need to know 
a (F). (See also Remark 3 of Section 4 for a more detailed discussion.) However, 
a,(F) or at least a lower bound for it will have to be known in order to make an 
arbitrary choice of À < a,(F'). In the absence of such information, it would be 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1115 


much nicer if we just take the GCM of H,(x) over the entire real line. Let C*(x) 
be the GCM of H, on the real line and G*(x) the corresponding distribution 
function. It will be shown in Section 5 that one can use C,*(x) instead of C,(x) 
(depending on à). This makes the estimator practically usable. Section 6 shows 
the asymptotic minimaxity of the estimators. Section 7 summarizes the result of 
this paper. Appendixes 1 to 4 provide the proofs of some of the lemmas and 
theorems. 

The main effort of this paper is devoted to the proof of Theorem 1 of Section 
3, although it seems to be more appropriate to call Theorem 3 of Section 5 the 
main theorem of the paper. 


2. Construction of the estimators C,,G, and some notation. We shall 
start this section with some definitions. 


DEFINITION 1. Let F be a distribution function. The hazard function H,(t) 
of F(t) is defined to be H,(t) = —log[1 — F(t)]. If, in addition, F has a density 
f, the failure rate function y,(t) of F(t) is defined to be y,(t) = f(t)/01 — FO) 
for F(t) < 1. Note that y,(t) is the derivative of the hazard function H,(t). 


DEFINITION 2. A distribution function F is said to have increasing failure 
rate if the support of F is an interval denoted by [a)(F’), a,(/")], and the hazard 
function H,(t) is convex on the support of F. 


Marshall and Proschan (1965) proved that an IFR distribution F is absolutely 
continuous except for the possibility of a jump at a,(¥’). Hence the failure rate 
ypr(t) exists (except possibly at a,(/’)) and is a nondecreasing function of t. 

For the rest of this paper, F will always be a distribution function with IFR. 
The shape of F is unknown except that it is known to have IFR. F, will be its 
empirical distribution function from a sample of independent and identically 
distributed observations. The notation H(t) will be used to denote Hp(t), and 
H(t) will be used to denote the empirical hazard function H,(t), that is, 
A(t) = —logl1 — F(t)]. Also ao, a, will be used instead of a (F) and a,(F). 
Since a@ 2 0 for most practical applications, we shall assume a, > 0, although all 
we need is that a) > — œœ, which will be implied by the uniform strict convexity 
assumption (A) in Theorems 1 and 2 on H. 

Since we shall estimate F through its hazard function H, for reasons men- 
tioned in Section 1 we shall consider the problem of estimating F and H only on 
[0, A], for any A < a,. Let C, be the GCM of H, on [0, A]. This GCM C, is the 
supremum of the convex functions that are smaller than or equal to H, on [0, A]. 
(For references on this subject, see the book by Barlow, Bartholomew, Bremner, 
and Brunk (1972).) Leurgans (1982) and Groeneboom (1983) provide asymptotic 
distributions of the slope of the GCM of some processes. Our estimate G, of F on 
the restricted range is then the distribution function that has C, for its hazard 
function. Explicitly, 

1 — G,(t) = exp{—C,(t)} for tin [0, A]. 
This type of estimator C, and G, will be used throughout Sections 3 and 4. 


1116 J.-L. WANG 


Let us now define some notation and a linear interpolating function for any 
function on [ap, À]. 

Let {k„, n < 1} be a sequence of positive integers satisfying k„ > œ. 

Partition the interval [a,, A] into k„ equal length subintervals [a?, a’, ,], 
J = 0,...,&, — 1, where 


nU l 
a; = F ao, J=0, s Řns 
n 
and 
L=\-— a. 


For any function g on [a , A], we define its linear interpolating function L,g 
as 


L,g(a") =g(a") for j=0,...,k, 


and linear on [a7, aj,,] for j = 0,...,2, = 1. 

In particular, L,H,, is the piecewise linear interpolating process of H,, and 
L,,H is the linear interpolating function of H. 

“Let lléll, = sup(|g(t)|s æo < t < A). 

For simplicity, let us introduce the following notation: Let e= 1— F(A). 
Then e> 0 and is fixed. Let S(t) = 1 — F(t) be the survival functions. Then 
S(t) = e for all t in [ay, À]. Let S(t) = 1 — F(t) be the empirical survival 
function. Since À < a, the failure rate y,(¢) of F has an upper bound on rt A]. 
Let M = |lyp||, denote this upper bound. 


3. Asymptotic n'/* equivalence of C, and H,. We shall prove the 
asymptotic n!/? equivalence of C, and H, in this section. Recall that as defined 
in Section 2, C, is the GCM of H, on [0, A]. An important assumption in many of 
our results is that there exist a c > 0 such that 
(A) H'(v) — H’(u) = e(v — u) 
for any u < v, both in [a 9, AJ, for which the derivatives exist. This assumption 
can be described as requiring that H be uniformly convex. Later it will also be 
necessary to assume there is a d < oo such that 
(B) H'(v) — H’(u) < d(v — u) 
for any u < v, both in [ap, A], for which the derivatives exist. 

The proof follows the same pattern as that of Theorem 1 of Kiefer and 
Wolfowitz (1976), and will be accomplished in the following five steps. 


STEP 1. 


LEMMA 1. For any convex function C(x) on [ao, A], 
IC, = Cll, < IZ, ~ Cllr. 


Proor. The proof of this lemma is similar to that of Marshall’s Lemma B 
(1970), except that Marshall assumes continuity, which is unnecessary. 0 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1117 


STEP 2. Under some restrictions on F, for suitably chosen k,, = o(n'/’%), 
L,,H,, (defined in Section 2) is convex with probability tending to 1. More 
explicitly, let the event A, = {L„ H, is convex on [ æg, AJ}. 


PROPOSITION 1. If F has IFR and satisfies assumption (A), then for suffi- 
ciently large n, 
nL’ Lec]? | 


1 — Pr(A,„) < zenen- GMK 


Proor. The proof is given in Appendix 1. 


REMARK. The same rate n! also appears in Lemma 4.1 of Prakasa Rao 
(1970), and there is some connection between the two results. 


STEP 3. Under A,, using Lemma 1 with C(x) = L,H,(x), we have 


IC, a A, lly Ss IC, on L,Aally + lln Hn T Hylla 
s 2\|L,H,, ~ Halla- 


STEP 4. 

PROPOSITION 2. If n'”||H — L,Hij, > 0 in probability, then 
nViH, — L,Hyll, > 0 in probability. 

Proor. The proof is given in Appendix 2. 


Step 5. Under some restrictions on F, for suitably chosen &, such that 
n'/4/k,, = 0(1), we have n'/?\|H — L,,H||, > 0 in probability. This follows from 
Proposition 3. 

PROPOSITION 3. If F has IFR and satisfies assumption (B) then |H — 
L,, ||, < 2dL?k;,*. 


Proor. The proof is similar to that of Lemma 6 in Kiefer and Wolfowitz 
(1976), pages 79-80. 0 
THEOREM 1. If F has IFR and satisfies assumptions (A) and (B), then 
nH, — C, ||, > 0 in probability. 


Proor., Let 
T L'ee?n [3 
3 | 300M logni ` 


By Proposition 1, for n large enough, 1 — P(A„) < n~*. Since n/“4/k,, = o(1), 
the results in Steps 3, 4, and 5 now imply the theorem. O 


1118 J.-L. WANG 


REMARK 1. As mentioned in the beginning of this section, the proof of 
Theorem 1 follows the same pattern as that of Theorem 1 of Kiefer and 
Wolfowitz (1976). The essential step is Proposition 1 which is similar to Lemma 4 
of their paper. The definition of the linear interpolating function in Kiefer and 
Wolfowitz (1976) is a little bit different, but the idea is the same. The conditions 
in our Theorem 1 are weaker than theirs, but their asymptotic n!/* equivalence 
holds almost surely while ours holds in probability (which is enough for our 
purposes). A similar weakening of conditions is possible for their results to hold in 
probability. 


REMARK 2. The assumption of existence of the constant d in Theorem 1 can 
be deleted by using a certain transformation. This will be done in Theorem 2 of 
the next section. The existence of the constant c will be guaranteed if 
infa, st<\ (t) > 0, which essentially means that H is uniformly convex on 
[æ A]. This is also assumed in Kiefer and Wolfowitz (1976) as the condition 
B(F) > 0 of (3.2). 


REMARK 3. Under the assumption of Theorem 1, Steps 4 and 5 imply that 
for k„ properly chosen (e.g., as in the proof of Theorem 1), n'/*||H, — L,,H,||, > 0 
in probability. Step 2 then implies that the probability that L„H,„ is convex on 
[ap A] approaches 1. Hence, instead of using C, (which involves n points), we 
can use L,,H,, (which only involves k, points) as the estimator. The advantage of 
using L,H,, is that it is much easier to compute than C,, and by Proposition 1, 
with probability tending to 1, L„H„ will be convex. If not, we can then take the 
GCM of L,,H,,, which is still asymptotically n! equivalent to H,, and easier to 
compute than C, itself. 


It should be noted that although the construction of L,H, depends on ay 
(while C, does not), in the case when a, is unknown, one can modify it by the 
linear interpolating process of H, on [ Xi), À], where Xa is the smallest among 
the X,’s. This will be clear from the proof of Theorem 1. 


4. Extension of Theorem 1 by a convex transformation. We shall show 
in this section that the assumption of existence of the constant d in Theorem 1 
can be deleted by use of the convex transformation ¢~!, where ¢ is defined as in 
Lemma 2. 


LEMMA 2. If F has IFR and satisfies assumption (A) then there is a convex 
increasing function K defined on some interval [a ,] with the following 
properties: 

(1) K is twice differentiable with bounded second derivative and inf{K (x): 
A SxXSN}>0. 
(2) (x) = H~'(K(x)) is a concave function from [a , N] to [ag, A]. 


Proor. The proof will be given in Appendix 3. 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1119 


THEOREM 2. If F has IFR and satisfies assumption (A), then 
nH —C,||, > 0 in probability. 


Proor. By Lemma 2, there is a convex function K on [@,, X] with properties 
(1) and (2) in Lemma 2. Let ¢(x) = H~(K(x)) as in Lemma 2. If X is a random 
variable with distribution function F and hazard function H, and if Y = ¢~(X), 
then P(Y > y) = e HOO) = eT EO), 

Hence Y has K as its hazard function and K satisfies all the assumptions in 
Theorem 1. 

Let X,,..., X,, be independent identically distributed samples from F, and let 
H, be their empirical hazard function. Let Y, = @71(X,), 1 <i<n, and K,(y) 
be the corresponding empirical hazard function from Y. Then H,(x) = 
K,(¢ x \x)). 

Let D, be the GCM of K, on [0, X]. Applying Theorem 1, we have 
(4.1) nD, — K, lly > 0 in probability. 


Let C*(x) = D,(¢7~(x)) on [ap, A]. From Lemma 2, C* is a convex function. 
_ Using the fact that H,(x) = K,(¢~‘(x)) and D, is a minorant of K,, we have 
C* is a minorant of H,. Moreover, C*(x) — H,(x) = D,(@7(x)) — K($ (x)) 
for all x. Hence ||C* — H,||, < ||.D, — Kaliy. Since C, is the GCM of H, and C* 
is a convex minorant of H,,, we have 


IC, ai Halla Ss \c* E Hylla Ss IDn z K,ylly- 
From (4.1), the theorem follows. 0 


REMARK 1. Theorem 2 implies that the only condition needed to guarantee 
the asymptotic n™? equivalence of C, and H, is the uniform convexity of H 
(assumption (A)). This is much weaker than the conditions required in Kiefer and 
Wolfowitz (1976). A similar weakening of conditions is also possible for their 
other results by using a similar transformation to that in Theorem 2. 


REMARK 2. So far À has been kept fixed so that the survival function is 
strictly greater than zero at A. From the proofs of Propositions 1-3, one can see 
that it is sometimes not necessary to restrict A to be fixed. For example, if F has 
a jump at a, and satisfies the assumption of Theorem 2, on [0, a,] the estimator 
C(x) can be taken to be the GCM of H(x) on [0,a,], and we have 
BUPy < x <a,'/*|H,(x) — C,(x)| > 0 in probability. Also under the assumption of 
Theorem 2, the estimator C, can be taken to be the GCM of H, on [0, A,,], where 
A, < a, and À, tends to a, at a suitably slow rate (depending on F). Then we 
have n'/?||H, — Calla, > 0 in probability. 


REMARK 3. We have assumed a, > 0 so far. If a, <0 and F satisfies 
assumption (A), it is clear that a, > — oo. Under such circumstances, C, can be 
constructed as the GCM of H, on [— œ, a] and the result of Theorem 2 still 
holds. Therefore, in constructing C,, one needs to know a, but not ao. In the next 
section, we shall show that even a, need not be known. 


1120 J.-L. WANG 


5. A practical way of constructing the estimators. The construction of 
C, in the previous three sections depends on the knowledge of a, so as to make 
an arbitrary choice of A which is less than «,. In most real life examples, one does 
not know a,. A practical way of constructing the estimator is to take the GCM of 
H, (x) over the entire real line. Let us call this GCM C(x). Let C(x) denote the 
GCM of H,(x) on [0, A], À < a; that is, CÀ is the estimator constructed in 
Section 2 and used in Sections 3 and 4. The conclusion of Proposition 4 is that for 
any fixed a < À, 

Pr{CA(x) = C*(x) forall x < a} > 1. 
That is, with high probability, the GCM of H,(x) on the entire support will 
coincide with the GCM of H,(x) on (0, A] for x in (0, a]. 

This suggests that instead of taking our estimator C)(x) as in Sections 2-4, we 
can simply take the GCM C,*(x) of H,(x) on the entire support and C,*(x) will 
behave just as well as C(x). In order to formally state Proposition 4, let us 
introduce the following definition. 


DEFINITION 3. A function ¢ on [a, b] is called strictly convex if for any 
0 <e <1 and any x, y in[a, b], the following is true: 


$((1 — e)x + ey) < (1 — e) p(x) + ep(y). 
Nore. (1) If F has IFR and satisfies the assumption of Theorem 2, then H is 
strictly convex. 


(2) For a strictly convex function ¢ on [a, b] and any three points x < y < z 
in [a, b], we have 


o(y) — $() _ olz) - elz) _ e(z) - oly) 
y—-x z-x z~-y i 
PROPOSITION 4. If F has IFR and H is strictly convex, then for any À < a, 
and anya < À, 
Pr{C\(x) = C*(x) forall x < a} > 1. 
Proor. The proof is given in Appendix 4. 


Theorem 2 and Proposition 4 now imply the main theorem of the paper. 


THEOREM 3. If F has IFR and there exists a positive constant c > 0 such 
that H’(v) — Hu) = e(v — u) forall u < v in [a, a,] for which the derivative 
exists, then n'/*\|H,, — C*||, > 0 in probability for all A < a. 


REMARK. Note that the assumption is slightly different from assumption 
(A). 


6. Asymptotic minimaxity of the estimator. Let #, = {F: F has IFR 
with a (F) > A}. The asymptotic minimaxity of H, as an estimator of the true 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1121 


hazard function H restricted on [— 00, A] among the class F, follows along the 
same line as in Millar (1979). Lemma 1 then implies that this optimality property 
also extends to the estimator CÀ for loss functions which satisfy the assumptions 
of Millar (1979) and is of the type U(n'/7\|C* — H||,) where Z is a bounded 
continuous nondecreasing function. Thus we have 


THEOREM 4. CÀ is the asymptotically minimax estimator of the hazard 
function H among the class F,. 


As for the estimator C*, Theorem 4 does not apply to C* REECE because 
Lemma 1 fails for C>. Although Theorem 3 implies 


(6.1) n'/*\\C* — H,||, > 0 in probability, 


the convergence in (6.1) is not uniform over distributions satisfying the assump- 
tion of Theorem 3, and there does not seem to be a natural way to restrict the 
family of distributions in order to obtain uniform convergence of (6.1). 

Let Gà, G* be the distribution functions with hazard functions CÀ and Cy, 
respectively. It follows from Theorems 1, 2, and 3 that: 


COROLLARY 1. (a) n'/*\|G) — F lla > 0 in probability under the assumptions 
of Theorem 2. 
(b) n!/7||G* — Flia > 0 in probability under the assumption of Theorem 3. 


To obtain the asymptotic minimaxity of G* or GÀ as an estimator of the true 
distribution F we encounter the same difficulty for G* as for C,*. Therefore we 
shall only focus on the behavior of GÀ. Since uniform convergence is hard to 
grasp by Theorem 2, we shall restrict ourselves only to distributions which satisfy 
the assumption of Theorem 1. Checking the proof of Theorem 1, we find that in 
order to get the uniform convergence of n/?||G* — F lla > 0, it suffices to show 
that both Proposition 1 and n'/?\|H — L,,H||, > 0 hold uniformly in F, which is 
equivalent to the requirement that both M[e,C]~? and dL? are uniformly 
bounded from above by some constants. Consider now the restricted family S, of 
IFR distributions which satisfies the assumption of Theorem 1 with M[e,C]~? 
and dL? both uniformly bounded from above by some constants. We then have 
SUPpey,n'/*||H, — Chlla > 0 in probability, which implies 


(6.2) sup n/7\|F — Gj, > 0 in probability. 
FEY, 


Using again the technique of Proposition 6.2 of Millar (1979) it can be checked 
without difficulty that F, is still asymptotically minimax among the restricted 
class “,. The asymptotic minimaxity of GÀ now follows from (6.2) for a loss 
function which satisfies the assumptions of Millar (1979) and is of the type 
l(n'/*\|GA — F||,), where J is a bounded continuous nondecreasing function. 


THEOREM 5. Gò is the asymptotically minimax estimator of the true distri- 
bution function F among the class S,. 


1122 J.-L. WANG 


7. Summary. We have discussed several possible estimators for the hazard 
function H. A summary of these estimators is given below. All the following 
results require that H be uniformly convex. 


1. If a, is known, then for any À < a, the estimator CÀ can be taken to be the 
GCM of H,,,, on [0, à] and CÀ is asymptotically n!/* equivalent to H, on [0, A] 
by Theorem 2. 

2. If a, is unknown, the estimator C,* can be taken to be the GCM of H, on the 
entire support of F, and C,* is asymptotically n!/” equivalent to H,„ on [0, A], 
for any A < a, by Theorem 3. In particular, if F has a jump at a, then C* is 
asymptotically n'/* equivalent to H, on [0, a,] and the corresponding estima- 
tor G, for the distribution function is asymptotically n? equivalent to F, on 
the whole real line. 

3. Under the additional assumption of (B), if we want to save time in computing 
the estimator for H, then for k„ properly chosen (for example, as in the proof 
of Theorem 1), one can take L,H,, to be the estimator if it is convex and 
otherwise, take the GCM of L,H,, to be the estimator. In either case, Theorem 
1 implies that the adopted estimator will be asymptotically n! equivalent to 
H,„ on [0, A] for any A < ay. 

4, It is clear that 


sup |C*(x) — H,(x)| = œ, if F is continuous. 
Asxsay 
Thus, C,* will not be close to H, near the right-hand tail. However, this does 
not prevent G* from being close to F, near the right-hand tail. The question 
is, under what conditions will sup_,, .,<..2'/|G*(x) — F (x)| - 0 in prob- 
ability. This remains an open problem. 


A final remark is that the technique in this paper can also be applied to 
distributions with decreasing failure rate by taking the estimator C, to be the 
least concave majorant of H,,. 


‘APPENDIX 1 


Proof of Proposition 1. Let x < y < z be any three equally spaced points in 
[ag A], such that y- x =z-—y=L/k,,. 
Let p = S(y)/ Xx), q = S(z)/S(x); then 0 < q <p <1. 


LEMMA 3. Let X be a binomial random variable B(n, p) with u < }. Then 
for t = 0, 


di - 4 5 
Pr{X — np > t} $e ~ Faw 


1 t? 
Pr{ X- nu < —t} S| 5 ayaa} 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1123 
Proor. The proof follows from Hoeffding (1963). 0 
Lemma 4. If F has IFR, then q > } for n sufficiently large. 
PROOF. 
_ Sz) S(x) ~ S(z) 
~ S(x) S(x) 
and 
S(x) - S(z) Js S(z)| 
S(x) Eo 
<2LM(k„£) > (since f(t) < r(t) < M) 
<} forlargen. 
Hence q > } for n large enough. 0 


LEMMA 5. If F has IFR and satisfies assumption (A), then S% y) — S(x)S(z) 
> cL*k=*S(x)S(z). 


PROOF. 
H(z) — H(y) = [H "H'(y + t) dt 
of H “y+ t) dt. 

Similarly, 
A(y) - Bin Hx + t) dt 


Using these and the existence of c, 
= [H(z) - H(y)] - [H(y) - H(x)] 
2 [ey +t)-H(x+t)| dt 


> [Pr ely — x) at 


= cL*k;?. 
On the other hand, i 
Lye S(y) age S(x) igs S?(y) 
ES) PS) P SSe) 


Hence, 


S*( y)[S(x)S(z)] > > exp[cL?k;?]. 


1124 ~J.-L. WANG 


This implies ~ 
S?(y) — S(x)S(z) = S(x)S(z)[exp(cL?k;z?) — 1] 
> S(x)S(z)(cL7k;?). QO 


PROOF OF PROPOSITION 1. Since L,H,, is linear on each of the k„ equal 
length intervals [a?, a7, ,], J = 0,...,2, — 1, 
Ran 2 


A, N (Hla?) 7 H,(a! r) s H,( ars) a H,(a”,,)} 


ll 


= N {S,(a7)S,(a7,2) < S2(a”,,)} 


where B,„, is the event in the above bracket. 

Fix some j and let x = af, y= afp 2 = afo Let Bt) = nS,(¢) for t in 
[a,, A]. Then B,(¢) is distributed as a binomial random variable B(n, S(t)). 

Given B„(x)= N, B,(y) has binomial distribution B(N, p) where p = 
S( y)/S(x) as defined above. Given B,(x) = N, B,(z) has binomial distribution 
B(N, q) where q = S(z)/S(x). 

By Lemma 4, for n large enough, p > q > }. 

Let U = Np — B,(y) and V = Bz) ~ Nq. Let By, be the complement of 
B,,,. For n large enough, consider the conditional probability 


P{ Be |B, (x) = n} = ees < S,(x)S,(z)|B,(x) = N} 
Pr{Bi(y) < B,(x)B,(2)|B,(x) = N} 
fay Pr{(Np - U)? < N(Nq + V)B,(x) = N} 
Pr{ (Np)? — 2UNp < N?’q + NV/B,(x) = Nn} 
Pr{N(p? — q) < (V+ 2Up)|B,(x) = N} 
s Seine - q) < V+ 2U|B,(x) = N}. 
Let £= iN( p? — q). Since F has IFR, 
= [H(2) — H(y)] - [H(y) - H(x)] > 0 
But 
S(x) 
S(x)S(z)’ 


hence S*( y) > S(x)S(z) and p° > q. This implies t > 0. Also pP- q <1 -q; 
hence 0 < £< 4N(1 — q). 


A = log 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1125 


Putting ¢ into our calculation (A1.1), we have 
Pr{3t < V+ 2U|B,(x) = N} 
< Pr{V>torU > tB (x)= N} 
1 t? 1 t? 
aE h EEZ team) 


1 t? 1 t? 
<f- F yag] + | TEER cent 





; 1 t? 
<P ZINO- a) 
N( p°- q) 
A- q) J 
Let a = (p°? — g)*/(24(1 — q)). Then 0 < a < 1/24. We have proved so far 


= zep - 


that, for n large enough, 


(A1.2) Pr{ Bo |B,(x) = N} < 2exp{—aN}. 


Now 


Pr{Bs,} = E{Pr[S2(y) < S,(x)S,(z)1B,(x)]} 

< E(2exp[—aB,(x)]} (from (A1.2)) 

= 2[1 — S(x) + S(x)e~°]” (since B,(x) is distributed as B(n, S(x))) 
= 2[1 — S(x)(1 - e~*)]” 


(Al3) < 2 exp[—nS(x)(1 — e~*)] [because l — H <et fro<g<n 


ermo] 


s 2exp| - 575(x)| (since a < 1). 


Consider 


(p?- a) S(x) _ [S%(y) - S(z)S(z)]’ 
A-4) 24 S(x) - S(z)]$*(x) 
[L7eS(x)S(z)k,7]” 

CORO GO 

[eoL7c]” 

hi [ F(z) — F(x)] 


aS(x) = 


= (since S(x) > eo). 


1126 J.-L. WANG 








Now 
F(z) — F(x) = f ICE) dt < 2LMkz’. 
Hence 
Lec]? 
(A1.4) aS(x) = de 
From (A1.3) and (A1.4) 
nL? [ec]? 
Pr{ Be} <s zerp] - TJ6Mk? ; 
Hence 
k,—2 
1- Pr(A„) = Pri} U Bg, 
J=0 
nL’ Eec]? 
<S 2k, el = ~96Mk2 i m 
APPENDIX 2 
Proof of Proposition 2. 2 


Lemma 6. Let X (t) = n [H (t) — H(t)]. Given e> 0, there exist ô > 0 
and an integer N, such that for all n > N,, 


Pr[sup{|X„(t) — X,(s)|; for all |t — s| < ô andt, s, in [ao A]} >e] < e. 


Proor. It can be checked by standard procedures that the process X,(t) 
converges weakly to a Gaussian process Z(t) with continuous paths in D[ a, A], 
where EZ(t) = 0, Var Z(t) = F(#){1 — F(t)]"!, and Cov(Z(s), Z(t)) = 
F(s)[1 — F(s)]"' if s< t. Since Z(¢) has continuous paths on D[ ap, A] with 
probability 1, the result follows from the tightness conditions on Skorokhod 
topology. O 


PROOF OF PROPOSITION 2. Given e > 0, since k,, —> œ there exists N, such 
that for all n > N,, Lk! < & for the ô in Lemma 6. 
Let SH be the piecewise shifted H. That is, 
S,H(t) = H(t) + [H,(a?) - H(a”)| fora? < t< añ, 


Then 
S,H(a") = H,(a") for j=0,...,R,. 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1127 
Consider 
n'*(H, — S,H)(t) = n{[H,(t) — H(t)] - [H,(a") - H(a?)]} 
= X,(t)- X,(a?) fora? < t< a}, 


Then 
nH,- SHa = sup sup |X,(t) — X,(a”)| 


Osysk,~-1 afst<ay,, 


< sup |X,(t)-X,(s)| ifn >N, 


lt-s|<8 
Lemma 6 now implies 
(A2.1) n'1H, — SHl > 0 in probability. 
Recall the denaii of L,, the linear interpolation in Section 2: 
L,H,(a}) = H,(a!) = S,H(af) =L,(S,H(af)), j= 0,- hy 
Since L,, is the piecewise linear interpolation process, 
l L,H,(t) = L,(S,H(t)) forall t. 
We have 
(H, ~ L,H,)(t) = (H, — S,H)(t) + (S,H — L,S,H)(t) 
+(L,S,H — L,,H,)(t) 
= (H, — S,H)(t) + (S,H — L,S,H)(t) 
= (H, — S,H)(t) + (H - L,H)(t), 
since SH is the piecewise shifted H. 
If nH — L,,H||, > 0 in probability, by (A2.1) this will imply n| H, — 
LH, ||, > 0 in probability. 0 


APPENDIX 3 
Proof of Lemma 2. 


Proor. From the definition of IFR distributions, H`? exists provided one 
defines H~'(0) = ay. Thus H~} y) is concave on [0, H(A)] and differentiable 
except for countably many points. Let N, be the exceptional set where H~} y) is 
not differentiable. Let J be the complement of No, that is, J = [0, H(A)] — Mo 
and f be the derivative of H~} on J. Since H’ is bounded on H~ (J), we have 
f(x) >6 for some 6>0 and all x in J, and f(u) =[H(H-(u))]"'> 


1128 J.-L. WANG 


[H"H-(v))]~! = f(v) for any u < v on J. Also, 


f(u) = fo) = [H’(H(u)) | - [E] 

= H"H~(v)) - HH-u)) 

~ HUH (u)) H(A (0)) 

> C[H (v) - H-(u)] f(u)f(o) 

> C[(o— u)f(v)] f(u)f(0). 
The last step is due to the fact that f is decreasing on J. Therefore, 
f(u) — f(v) 
(v - u)f(v) 
for any pair u < v in d. 

We are now looking for a decreasing linear function g(u)= b — au with 
g(u) > 0 on [0, H(A)] so that the ratio f(u)}/g(u) will be decreasing on J. This 
can be done in the following way. 

Given 6 > 0, there exists a>0 such that b- aH(A)>0 and 0<a< 
cô’[b — aH(A)]. From (A3.1) 
f(u) - f(e) 
(v — u)f(v) 
for all u < v in J. Hence 


[ f(u) — f(v)][b - av] > af(o)(v — u), 


(A3.1) > cf(u)f(v) = cd? 


TET 
[peas Te e) 


a< 


[b - av] 


and this implies 
Hw) Ao) 
b—au b-—av 
Hence, we have found g such that f(u)/g(u) is decreasing on J and g(u) — 
g(v) = a(v — u). Let G(y) = ao + fgg(u)du. Then G is concave and G” = ~a 
on [0, H(A)]. Let K(x) = G7 (x) on [ag, X], where [ æo, X] is the range of G on 
[0, H(A)]. Then K is convex and K” exists on [a (F), X]. Moreover 


G”(G—:(x)) -3 
- — = a| g(G (x 

[G(G-\(x))]? [ e( ( ))] 
This means K(x) = ab~* > 0 for all x in [ap, X] and K” has a finite upper 
bound because g is bounded away from zero. Part (1) has thus been proved. 

Let $(x) = H~\K(x)). From the way K is defined, ¢ is a function from 
[a , X] to [aos A]. Note that (x) will exist if K(x) exists, and at any point x 
where K (x) exists, 








foru < vin J. 


K(x) = [0 e)]" = 


K=) _ f(K()) 
H’[H-\(K(x))] — g(K(x))” 


Hence, (x) exists and is a decreasing function on [a , à] except for countably 
many points. This proves the concavity of @ and hence part (2). 0 





p(x) = 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1129 
APPENDIX 4 
Proof of Proposition 4. 


Proor. The proof will be separated into two parts. 

(i) This is the case where F has a jump at a, with 1 — F(a, — ) > 0. In the 
proof of Lemma 6 in Appendix 2, we showed that yn (H„(x) — H(x)) converges 
in distribution (in the Prohorov sense) to a Gaussian process Z(x). Let || Ilea, 
denote the supremum norm, of a function on [@p, a,). By Donsker’s invariance 
principle (1952), n'/7\|H, — H lla, is bounded in probability. For any e > 0, there 
exists N large enough such that pr{||H, — H||,, <€} >1-—e for all n2N. 
Let E, be the event that ||H,, — H||,,<e. Then we have P(E,) > 1-— e for 
n > N. Let b be a fixed point between a and À. Let m be the right-hand 
derivative of C(x) at x =a. Under E,,, H,(x) lies entirely within the band 
{H(x) — e, H(x) + e}. Hence 


ee [H(b) + e] — [H(a) — e] _ H(b) — H(a) + 2e 
b-a b-a i 
On the other hand, for any y > À, 
H,(y) ~ H,(a) | H(y) — H(a) ~ 2e 





ya y-a 
H( y) — H(a) 2e 
D me te 9 annie 
y-a A-a 
H(A) - H(a) 2e 
> - —. 
A-a A-a 


If e is small enough, by strict convexity of H we have 
[H(A) — H(a)]/(A - a) — 2e/(A — a) > [H(b) — H(a) + 2e]/(b - a). 


Hence, [H (y) — H,(a)]/(y — a) > m for any y = A. This means that for y > À, 
H,(y) lies above the line passing through (a, H,(a)) with slope m. Since 

a) < H,(a), H,(y) also lies above the line passing through (a, C)(a)) with 
slope m. Thus for all y > A, H,(y) does not affect C(a), hence does not affect 
C(x) for all x < a. We have shown that under E,,, C(x) = C*(x) for all x < a. 
Thus, pr{C(x) = C(x) for C* all x < a} > 1. 

(ii) This is the case where F is continuous. If X distributes as F, then F(X) 
distributes as U(0,1), the uniform distribution on [0,1]. Also its survival func- 
tion SCX) = 1 — F(X) distributes as U(0,1). From Karlin (1972), page 250, for 
æa > 1, Pr{S (x) < aS(x) for all x} =1-—1/a for all n. This implies 
Pr{H,(x) = —loga + H(x) for all x} = 1 — 1/a for all n. For any e> 0, let 
b = loga = —loge. We have 


(A4.1) Pr{H,(x) = H(x)- b} =1 ~-e. 
As in (i), n? H, ~ ||, is bounded in probability. Hence, given any 6 > 0, there 


1130 J.-L. WANG 


exists N, large enough such that 
Pr{||H, — H| <8} >1-e forn>N,. 


Let m be the right-hand derivative of C(x) at x = a. We want to show that 
with high probability, for x close to a, and n large enough, H,(x) is above the 
line passing through (a,C\(a)) with slope m. Hence, what happens in the 
right-hand tail does not affect C(x) for x < a. To achieve this goal, let E, be 
the event that |H, — Hli < 6. Under E,, H,„ lies within the band (H(x) — 
ô, H(x) + 8}. Hence, m < (H(A) — H(a) + 28]/(A — a) and H,(a) < H(a) + ô. 
From (A4.1), it is sufficient to show that H(x) — b > [H(a) + 6] + [HA) - 
H(a) + 28)(A — a)" (x -— a) for x close to a,, or equivalently, [H(x) — 
H(a)|//(x — a) > (b + 8)/(x — a) + [H(A) — Ha) + 28]/(À — a) for x close 
to a,. 

If a, < œ, the ratio [H(x) — H(a)]/(x — a) increases to œ as x tends to 
a,(F’). Hence there exists x, such that for x > x), [H(x) — H(a)|/(x —a)= 
(b + 8)/(x — a) + [H(A) — H(a) + 28]/(A — a). 

If a, = œ, this means (b + 8)/(x—a)—0 as x >a. Since [H(x) — 
H(a)|/(x — a) > (H(A) — H(a)J/(A — a) for x > A, with 6 small enough, there 
exists x,, such that for x = x 9, [H(x) — H(a)]/(x — a) > (b+ 8)/(x — a) + 
[ H(A) — H(a) + 28]/(A — a). 

So far, we have shown that given e > 0, there exist x,(e) and N, > 0 such that 
for n > N, 


(A4.2) Pr{forall y > x, H,( y) does not affect C(x) for all x < a} = 1 — 2e. 


For this x, n!/*||H, — H ||., i8 bounded in probability. Using the same argument 
as in (i), one can show that there exists N, > 0 such that for n > Np, 


(A4.3) Pr{CM(x) = Czo(x) for all x < a} > 1- e. 
Combining (A4.2) and (A4.3) we have 

Pr{CÀ(x) = C(x) forall x < a} > 1 — 3e 
for n large enough. This proves the result for (ii). 0 


Acknowledgments. This paper is dedicated to the memory of the late 
Professor Jack Kiefer who first introduced and guided me to research in this area. 
I am grateful to Professor Lucien Le Cam for many inspiring discussions during 
the course of this work. The referee and the Associate Editor have made many 
helpful comments and suggestions. 


REFERENCES 


BaRLow, R. E., BARTHOLOMEW, D. J., BREMNER, J. M., and BRUNg, H. D. (1972). Statistical 
Inference Under Order Restrictions. Wiley, New York. 

BILLINGSLEY, P. (1968). Convergence of Probabuity Measures. Wiley, New York. 

Donsker, M. D. (1952). Justification and extension of Doob’s heuristic approach to the 
Kolmogorov-Smirnov theorems. Ann. Math. Statist. 23 277-281. 

DVORETZKY, A., KIEFER, J., and WOLFOWTTZ, J. (1956). Asymptotic minimax character of the sample 
distribution function and of the classical multinomial estimator. Ann. Math. Statist. 27 
642-669. 


MINIMAX ESTIMATORS FOR IFR DISTRIBUTIONS 1181 


GRENANDER, U. (1956). On the theory of mortality measurement. Part II. Scand. Actuar. J. 39 
125-153. 

GROENEBOOM, P. (1983). The concave majorant of Brownian motion. Ann. Probab. 11 1016-1027. 

HOEFFDING, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. 
Statist. Assoc. 58 13-30. 

KARLIN, S. (1972). A First Course in Stochastic Processes. Academic, New York. 

KIEFER, J. and WOLFOwITZ, J. (1976). Asymptotically minimax estimation of concave and convex 
distribution functions. Z. Wahrsch. verw. Gebiete 34 73-85. 

LEuRGANS, S. (1982). Asymptotic distributions of slope of greatest-convex-minorant estimators. 
Ann. Statist, 10 287-296. 

MARSHALL, A. W. (1970). Discussion of Barlow and van Zwet’s papers. In Nonparametric Tech- 
niques in Statistical Inference (M. L. Puri, ed.) 175-176. Cambridge Univ. Press. 

MARSHALL, A. W. and Proscuan, F. (1965). Maximum likelihood estimation for distributions with 
monotone failure rates. Ann. Math. Statist. 36 69-77. 

MILLAR, P. W. (1979). Asymptotic minimax theorems for the sample distribution function. Z. 
Wahrsch, verw. Gebiete 48 233-252. 

Prakasa Rao, B. L. S. (1970). Estimation for distributions with monotone failure rate. Ann. Math. 
Statist. 41 507-519. 

Wana, J.-L. (1982), Asymptotically minimax estimators for distributions with increasing failure rate. 
Ph.D. dissertation. Univ. California, Berkeley. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF CALIFORNIA 
Davis, CALIFORNIA 95616 


The Annals of Statistics 
1986, Vol 14, No 3, 1132~1138 


ESTIMATION OF A UNIMODAL DISTRIBUTION FUNCTION 


By SHaw-Hwa Lo 
Ruigers University 


This paper deals with the problem of efficiently estimating (asymptoti- 
cally minimax) a distribution function when essentially nothing is known 
about it except that it is unimodal. 

The sample distribution function F, is shown to be asymptotically mini- 
max among the family & of all unimodal distribution functions. Since F, does 
not belong to this family, estimators belonging to this family are constructed 
and are shown to be asymptotically minimax relative to the collection of 
subfamilies of £. 


1. Introduction. In their pioneering paper, Dvoretzky, Kiefer, and Wolfo- 
witz (1956) proved that the sample distribution function F, is asymptotically 
minimax (a.m.) in the collection of all continuous distribution functions (d.f.’s). 
After 20 years, Kiefer and Wolfowitz (1976), motivated by reliability theory (see 
Barlow et al. (1972)), reopened the problem and proved that the sample d.f. is 
still a.m. either in the class of all concave d.f.’s or in the class of all convex d.f.’s. 
Furthermore, in the same paper, by using Marshall’s lemma (1970) they im- 
mediately got that C, (the least concave majorant or the greatest convex 
minorant of F), which is concave (convex) and hence suitable to be used as an 
estimator, is also a.m. for estimating F. In the same paper, Kiefer and Wolfowitz 
noted some interesting open problems which are related to reliability theory. 
Two of them are estimating increasing (decreasing) failure rate distributions and 
estimating unimodal distributions, The first problem was later considered by 
Millar (1979); he showed that the sample d.f. is still am. among the class of all 
increasing (decreasing) failure rate distribution functions. Wang (1982) showed 
that under some additional assumptions it is possible to find an estimator C, 
which is a.m. such that C, itself is in the class of increasing failure rate 
distributions. The present paper considers the second problem; i.e., estimating a 
unimodal distribution function. In the next section, the author gives the defini- 
tion of a unimodal distribution function and proves that the sample d.f. F, is 
still a.m. among the family & of all unimodal distribution functions (Theorem 
2.1). Since F, does not belong to this family &, estimators (Ê, ) belonging to this 
family are constructed and are shown to be in -close (in supremum norm) to the 
sample d.f. uniformly among the subfamily £*(5,,M,k) of £ (see (2.4)). A 
slightly weaker concept “a.m. relative to a family” is defined (see (2.5)), and the 
estimator Ê, (as well as F,) is proved to be a.m. relative to the family 
{&*(5), M, k)} (Theorem 2.2). Section 2 contains our main results. All the proofs 
are given in Section 3. 


Received May 1983; revised November 1985. 
AMS 1980 subject classifications. 62E20, 62G20. 
Key words and phrases. Unimodal distribution function, asymptotically minimax. 


1132 


UNIMODAL DISTRIBUTION FUNCTION 1133 


2. Main results. A function f is unimodal at @ if and only if f is nonde- 
creasing at x for x < 0 and f is nonincreasing at x for x > 6. We consider the 
collection £ as follows: 


£= (F(x); F(x) is an absolutely continuous d.f. 
with a unimodal density function f(x)}. 


Let B denote the collection of all cumulative distribution functions on the real 
line. In this paper, we consider the loss function for a sample size n as L, 
B x B > R* = [0, œ) with L,(F,G) = l(n'“(F — G)), where Z is subconvex 
with the properties that El(n'/*(F, — F)) converges to E(W°(F)), and WYF) 
is the Brownian bridge process composed with F. These assumptions are essen- 
tially the same as the ones used by Millar (1979), and also cover the classical loss 
functions such as Kolmogorov distance and von Mises distance used by Kiefer 
and Wolfowitz (1956, 1976). 

An estimator ¢, of F is a.m. in & if 


BUPr cg Epl( n (hn = F)) 


(2.1) Di er E, 
n=% inf supp es Er{ fl(n'/*(y — F))0(x,n), dy)} 
where x,„) denotes (X, %2,...,%,) and b runs over all randomized procedures. 


One can use Millar’s (1979) sufficient conditions to prove the following theo- 
rem. 


THEOREM 2.1. Let L, be described as above. Then the sample d.f. F, is 
a.m. (in the sense of (2.1)) among the collection £. 


The proof of this theorem is deferred and will be given in the next section. 
Since the sample d.f. may not belong to &, it is not a proper estimator to use in 
some situations. Therefore, we are going to construct some estimators Ê, (mod- 
ified by F,) which belong to & and are close to F,. The constructions involve the 
estimation of the mode. The problems of estimating a mode have been studied by 
Chernoff (1964), Grenander (1965), and Venter (1967). The following proposition 
is proved in Venter (1967). The rates of convergence have been shown to be the 
best possible (see Hasminskii (1974)). 


PROPOSITION 1 (Venter, 1967). Suppose f(x) has a unique mode at 6. Let 
ô > 0 and write 
a,(8) = min{ f(x);8-8<x<6+ 5}, 
a,(8) = max{ f(x); x < 8 — 28,0 + 28 <x}, 
a(5) = a,(8)/a,(5). l 
Suppose the following condition holds: 
For all 8 small enough aļ(8) = 1+ pô*, 


(2.2) 
where p and k are positive constants. 


1134 S.-H. LO 


Then one can find proper estimators 6, such that 6, = 0 + o(8,) w.p. 1, 
where 


` 8, = n0t logn) ifk >t 
(2.3) Coen) ? 


= n~V%°(log n)" ifk <i, 


Note that the speed of convergence of 6, to @ depends on the knowledge of 
smoothness of f near 8. Consider the following subcollections of £: 

Let 5) be a small positive number, and let K, M be two positive constants. 
Define 


(2.4) &*(5),M, K ) = {F; F € £ and there exists a p* < M such that 


1 + p*8* > a(8) = 1 + pd* forall 8 < 8}. 


It follows from the results in Venter (1967) that among the subcollection 
&*(8), M, K), ĝ, has the property that the speed of convergence of 6, to @ is 
given by (2.3) uniformly in &*(8), M, K). 

Consider the estimator F(x) of F(x) as follows: 

Let Ê, be constructed as the least concave majorant (LCM) of F(x) on x > 6, 
and the greatest convex minorant (GCM) on x < 6. It is easy to construct a 
modified version, say F,, of Ê, such that ||Ê, — ÊI] < 1/n w.p. 1, and Ê, isin £ 
and has 6, as its unique mode. 

The following theorem tells us that the difference n? Ê — F|] is essentially 
no bigger than n'/?||F— F|] in each subcollection £*(5,,M, K), and hence 
yields a slightly weaker a.m. result as follows: š 

An estimator ¢, is a.m. relative to the family {@*(6), M, K); 8), M, K > 0} if 


; SUPr ees, M, Erl n (S, z F)) 
(2.5) sup lim- A 
(8), M, K) 7> inf supp e s*a, M, K)Er| fl[n (y= F)] b( x(n)» dy)} 





=1. 


THEOREM 2.2. For every &*(5), M, K ) described as above, 
(2.6) va lÉ,- Flo < Vall, — Flo + (1) 
uniformly in F € &*(6), M, K ). Furthermore, Ê, is a.m. relative to the family 
{8*8 M, K)}. 


REMARK 1. The first part of Theorem 2.2 does not imply that Ê, is a.m. 
among £*(8,,M,K) since the sample d.f. F, may not be am. among 
&*(5,, M, K). 


REMARK 2. From the proof (given the the next section) of the second part of 
Theorem 2.2, one can show that (2.5) holds with fixed K = 2. 


REMARK 3. Note that £*(8,, M, K) C &*(8,, M, K) for 6, < ô, and for 
every fixed M and K. Let &*(M, K) = Us, >08 *(ôo, M, K). It can be shown 


UNIMODAL DISTRIBUTION FUNCTION 1136 


(see the proof of Theorem 2.2) that F, is am. among &*(M, 2), for some M > 0, 
but it is not clear at this moment whether F, is a.m. among &*(M, K) for K + 2. 


Before closing this section, we give an example. 
Suppose f satisfies 
(2.7) — f(x) = 9 — v(x — 8)? + ofjx — 6|?) asx > 6 for Yo, Y > 0. 
There exists a ô, < y}?/10y™? such that the term 
jo(|x — 8|?)| < (X — 6)*min(y/10, Ya/10) if |x — 8| < 8). 
Therefore, if |A] < 5), one can write 
f(O+A) — yo~ (y— 0(d*)/0?) 3(y — 0( A?) /A?) A? 
IOF 2A) y—4(y—0(A2)/A)X a- 4(y — (A) 02) 








3(y — o( A?) /A?) A? 3( 2 )A? 27 
21+ MEENA] zi EZ) lg, 
Yo Yo 10 Yo 
On the other hand, 
2 
f(0 +A) B 1 3300 Y e 


r E E ni Gs 
f(0 + 24A) Yo — 76000 964 Yo 
Therefore, the corresponding d.f. F(x) € &*(y)/7/10y'”, M,2) for any M> 
3300/964 (Y/Y). 
3. Proofs. 


PRoorF OF THEOREM 2.1. Take K(x) = (x) to be the standard normal d.f. 
with density (x) = (1/ V27 )e~* ”. It is clear that ® € g. It suffices to show 
that £ is radially dense at ® as Millar (1979) pointed out. 

Consider the densities of the form 


(x; n-7h(x)) = o(x)(1 + n-7h(x)). 


(x; n~/7A(x)) is a density if {°,,.¢(x)h(x) dx = 0 and sup,|n~!/2h(x)| < 4. To 
assure $(x; n~'/*h) in &, consider 


H, = (a; J Moa) dx = 0, J POW) dx < œ, sup|n "h(x )| <}, 


h(x) = 0if x € [-e,, e„], and |A(x)|< ine, ifx ¢[-e,, eal}; 


where {e,} is a positive sequence tending to zero with n/c, > œ as n > œ. 
Clearly, U%_, H, is dense in H(®) where H(®) is defined in Millar (1979) as 


H(®) = (h: J” haola) dx = 0 and J woela) dx < o}. 


Direct calculation of ¢(x;n~/7A) shows that $(x;n~/*h) is unimodal, and 


1136 S.-H. LO 


hence in & when h € H,. This shows that ® is a radial cluster point, and the 
theorem thus follows. O 


We need some lemmas to prove Theorem 2.2. For any F € &*(6,, M, K), let f 
denote the density of F. Let £(6 + 8,) = inf{ f(@ + x); |x| < 6,} for 8, < 8. 


Lemma 1. Assume F € &*(5,, M, K). Then 

(3.1) a= SEIGE) de ~ 28,0(0 + 8) = o(n-) 
0—8, 
uniformly in £*(8,, M, K), where 5, is defined as in (2.3). 

Proor. First note that 26, f(@ + 5,) < 1; therefore, f(8 + ô) < 1/28,. (This 
is true for all F in &*(4), M, K ).) By the definition of #*(5), M, K), one can 
write 

p 
(3.2) i(8 + 8,/2P) < (0 + 8,) TT [1 + M(8,72)*]. 
gel 
Taking the log, we obtain 


p p 1 jk M2g2* P 1 2Jk 
log [T [1 + M(8,/2)*] < mos © (3) + AA 
Jol jel ym 


= ën p (say), 
since log(1 + x) < x + x?/2 if x > 0. Therefore, 





(3.3) 


P “oe 
TI [1 + m(8,/24)*] seme <ite,, 
j-i 


(En, p < 1 if 6, is small enough). We obtain 





(3.4) £(6 + 8,/2P) < #(6 + 8,)(1+L,, ,d¢), 
where 
Pfi\se M2gk P 14 \2Ik 
Ln =MY |3) =» (5) >L,<% apo. 
: pale 2 ja1\2 


This together with the fact that f(6 + 6,) < 1/26, implies f(@) < L,/26, for 
5,, < 5). This shows that the densities of &*(6), M, K ) are uniformly bounded. 
From (3.4), we have 


£(0 + 6,/2”) — ECO + 6,) < ECO + 6,)L,, Of 
< f(8)L,5" = O(1)87 


(3.5) 


uniformly in &*(4), M, K). 
From Proposition 1 and (3.5), 
A,, < O(1)840(5,) < 0( 54+") 
n~ +D/0+24) (Jog n)” if k > 
n-FtDA2(log n)*t*V* ifk< 
= o(n7/), o 


vb w 


UNIMODAL DISTRIBUTION FUNCTION 1187 


Lemma 2. Under the assumptions of Lemma 1, let f, be any unimodal 
density function which is identical with f(x) outside I, = (6 — 6,6 + 6,), and 
let F* denote the distribution function of f„. Then 


(3.6) sup |F,"(x) — F(x)| = o(n™”) 
x 
uniformly in &*(5), M, K). 
Proor. It suffices to show that 
sup |F,*(x) — F(x)| = o(a?) 
xel, 
uniformly in #*(5),,M,K). F* unimodal implies that f(x) = f(@ + 6,) if 


x € I. Since F* isa d.f., 
0+ 


AO, dt — £(0 + ô, )26, as f 
6-8, (i 
A,, is defined as in (3.1). Therefore, for x € [,, 


Ira) — F=f" tae O 


hilt) dt — 28,8(9 + 8,) < A,: 
-8, 





<| f ft) de= (2-04 8, (0 £ 8,) 
6-8, 





ff 10 — (x — 0 +8,)#(0 + 8.) 


< 2A,. 
The lemma thus follows from Lemma 1. 0 


Lemma 3. Suppose 6, € I, Let f,, F* be as in Lemma 2 with the mode of f, 
at @,. Then 


(3.7) sup |F,(x) — F,*(x)| < sup |F,(+) ~ F(x)|. 


Proor. Recall that Ê, is constructed in Section 2. Since F, F* are both 
convex if x < 6, and both concave if x > 6,, the lemma follows directly from 
Marshall’s lemma (1970). O 


Proof OF THEOREM 2.2. From (3.6) and (3.7), 
sup|#,(x) — F(x)| < sup|F,(x) — F*(x)| + sup|F,*(x) - F(x)| 


< sup|F,(x) — F*(x)| + 0,(n-”?) 
s sup|F,(x) — F(x)| + 0,(n7”*) 


uniformly in &*(6,, M, K). 


1138 8.-H. LO 


The first part of the theorem follows immediately from the above fact. 

To show Ê, is a.m. relative to the family {#*(5,, M, K)}, it suffices to show 
that F, is a.m. relative to the family {&*(6), M, K )}. If we can show that F, is 
a.m. (in the sense of (2.1)) among the collection &*(M, 2) = Us, . o&*(5o, M, 2) 
for some M > 0, then this, together with the fact that lims 08 *(5o, M, 2) = 
& *(M, 2), will imply 


6.8) up tom eeu nell n™(6 ~ P) i 
.8) sup a , 
& no inf Supp e e+e, Mg Erl Sily S F)| bas; dy)} 


So, it suffices to show F, is a.m. among the collection &*(M, 2). 

To see this, we claim that ®, the standard normal d.f. is again a radial cluster 
point in the family &*(M,2). Since $(x) = (1/ V20)e~* /? satisfies (2.7) with 
Yo = Y = 1/ V27, we have ® € &*(8,, M,2) C &*(M, 2) for some proper 6, and 
M. For any h € H, (defined in the beginning of this section), it is easy to check 
that (x; n~1h) E€ &*(5*, M,2) for some 8* > 0. Since U%_,H, is dense in 
H(®), this shows that ©® is a radial cluster point in &*(M, 2), and the theorem 
thus follows. O 


Acknowledgment. The author is grateful to one of the associate editors for 
his critical reading and most helpful comments and suggestions (including (2.5)) 
on the original manuscript. 


REFERENCES 


BARLOW, R. E., BARTHOLOMEW, D. J., BREMNER, J. M. and Brung, H. D. (1972). Statistical 
Inference under Order Restrictions. Wiley, New York. 

CHERNOFF, H. (1964). Estimation of the mode. Ann. Inst. Statıst. Math. 16 31—41. 

DVORETZKY, A., KIEFER, J. and WoLrowiTz, J. (1956). Asymptotic minimax character of the 
sample distribution function and of the classical multinomial estimator. Ann. Math. 
Statıst. 27 642-669. 

GRENANDER, U. (1965). Some direct estimates of the mode. Ann. Math. Statist. 38 131-138. 

HAsMINSKII, R. Z. (1979). Lower bound for the risks on nonparametric estimates of the mode. In 
Contributions to Statistics, Jaroslav Hájek Memorial Volume (J. Jurečková, ed.). 
Academia, Prague. 

KIEFER, J. and WoLFOWTTZ, J. (1976). Asymptotically minimax estimation of concave and convex 
distribution functions. Z. Wahrsch. verw. Gebiete 34 73-85. 

MarsHALL, A. W. (1970). Discussion of Barlow and van Zwet’s papers. In Nonparametric Tech- 
niques ın Statistical Inference (M. L. Puri, ed.) 175-176. Cambridge Univ. Press. 

MILLAR, P. W. (1979). Asymptotic minimax theorems for the sample distribution functions. Z. 
Wahrsch. verw. Gebiete 48 233-262. 

VENTER, J. (1967). On the estimation of the mode. Ann. Math. Statist. 38 1446-1455. 

WANG, J. L. (1982). Asymptotically minimax estimators for distributions with increasing failure rate. 
Ph.D. dissertation, Univ. California, Berkeley. 


DEPARTMENT OF STATISTICS 
RUTGERS UNIVERSITY 
New BRUNSWICK, NEW JERSEY 08903 


The Annals of Statistics 
1986, Vol 14, No, 3, 1139-1161 


ON ASYMPTOTICALLY EFFICIENT ESTIMATION IN 
SEMIPARAMETRIC MODELS 


By ANTON SCHICK 


State University of New York-Binghamton 


A general method for the construction of asymptotically efficient esti- 
mates in semiparametric models is presented. It improves and modifies 
Bickel’s (1982) construction of adaptive estimates and obtains asymptotically 
efficient estimates under conditions weaker than those in Bickel. 


1. Introduction. In this paper we give a general method for the construction 
of asymptotically efficient estimates in semiparametric models. More specifically, 
our estimates are regular with smallest possible asymptotic variance as discussed 
in Begun, Hall, Huang, and Wellner (1983) and are LAM-adaptive in the sense of 
Fabian and Hannan (1982) and adaptive in the sense of Begun, Hall, Huang, and 
Wellner (1983) if Stein’s (1956) necessary condition for adaptive estimation holds. 
Our construction improves and generalizes Bickel’s (1982) method of constructing 
adaptive estimates: We obtain asymptotically efficient estimates under weaker 
conditions than in Bickel and use the entire sample to construct estimates of the 
score function or the nuisance parameter and not just a small fraction as Bickel 
does. Our construction compares also favorably with a construction given by 
Huang (1982) in a thesis. 

We show that Bickel’s condition S* which he reasons is “heuristically neces- 
sary” for the existence of adaptive estimates in convex models and which 
motivates Bickel’s construction is not necessary for adaptive estimation in 
general. We replace it by a weaker condition and show that this condition is 
necessary for our construction. It is seen that if S* does not hold adaptive 
estimates are more difficult to construct in that a certain rate of convergence is 
required for the estimate of the nuisance parameter. 

Our paper is organized as follows. In Section 2 we present the construction of 
asymptotically efficient estimates. In Section 3 we present examples and show 
that condition S* is not necessary for the construction of adaptive estimates. 

Some notation will be introduced next. ( ) will be used to denote finite or 
infinite sequences, and, in particular, points in R*. In matrix calculations, points 
in R* are columns. 

If (An An P,) is a probability space and g, is a measurable function on A,, 
to R* for each n = 1,2,..., and if c € R*, then (i) we write g, > c in (P,)-prob. 
if P,(\lg, — cll > e) > 0 for every e> 0 and (ii) we say (g,) is bounded in 
(P,,)-prob. if (F,) is tight, where F, is the distribution of g, under P.. 


Received September 1984; revised October 1985. 
AMS 1980 sulyect classifications. 62G05, 62G20. 
Key words and phrases. Efficient estimation, adaptation, semiparametric model, regression model. 


1139 


1140 A. SCHICK 


2. The construction of asymptotically efficient estimates. Throughout 
this section we assume that { f,, 8 € O} is a family of probability densities with 
respect to a sigma-finite measure »y on a measurable space (S, S), that the index 
set O = ©, X O, for some open subset 6, of R? and some arbitrary nonempty 
set ©,, and that, for every 6 € ©, there is a pọ in L?(v) such that 


(2.1) | hia, B) f?” — ap l, = o(jjajl), 


where ||- ||, denotes the L.(»)}norm. We also assume that the matrix 


T4(8) = 4 fogog? dv 


is nonsingular for all 6 € ©, where pj is the vector whose components are the 
projections of the components of p, onto the orthogonal complement of J,(6), the 
set of all y in L,(v) for which there is a map 7 on (—1,1) into ©, such that 
(0) = 6, and 


| Frano =- f?” — a y= byl = o(|jal? + b?). 


The above conditions generalize some of the concepts in Begun, Hall, Huang, 
and Wellner (1983) to the case of an arbitrary nuisance parameter set @,. The 
reader familiar with their paper will recognize that J,(@) plays the role of their 
tangent space {AB: B € B} and pf generalizes what they call the effective score 
function. Observe also that Stein’s (1956) necessary condition for adaptive esti- 
mation as reformulated by Bickel (1982) and Fabian and Hannan (1982) can be 
stated, in the present context, as 


(S) pg =p; forall ð € ®. 

For convenience in notation we shall often write f(-, t, v) instead of feso and 
similarly for other functions gy, 0 € 8. 

Now consider probability measures {P}, 0 € 6} and S-valued random vari- 
ables X,, X,,..., such that under each P,, X,, X,,... are independent and 
identically distributed with density fẹ. Our goal is to construct an estimate 
(Za = (Z,(X1,-.-, X,)) which satisfies 
(2.2) n/?(Z, — Z,(8)) 20 in P,-prob. 
for each 0 €e ©, where 


1 n 
Z,(0) = 0, += È (X, 8) 


jel 
and 
ll, 6) S IR (0)2 fe 05x 4,>0)- 
We call an estimate (Z,,) that satisfies (2.2) asymptotically linear at @ and write 
(Z,,) is AL{@). An estimate that is AL(@) for all @ € © is said to be asymptoti- 
cally linear. Our interest in asymptotically linear estimates is based on the 
following results. 


ASYMPTOTICALLY EFFICIENT ESTIMATION 1141 


Suppose (Z,) is AI{0), then (Z,) is regular at 6, i.e., for every sequence 
(tas Da) in O, X (—1,1) such that (n!/(¢, — 6,, b,)) is bounded and every map 
7 as described in the definition of -J,(@) 


$(n7(Z, — ty)IP,. 9d) = 4 (0, Te(8))- 


This follows from a straightforward contiguity argument. Moreover, under mild 
additional assumptions (Z,,) has the smallest possible asymptotic variance among 
all estimates regular at 6 [see Begun, Hall, Huang, and Wellner (1983), Theorem 
3.1], and if the necessary condition for adaptive estimation pf = pọ holds, then 
(Za) is LAM-adaptive at @ in the sense of Fabian and Hannan (1982) and 
adaptive at @ in the sense of Begun, Hall, Huang, and Wellner (1983). 


REMARK 1. Bickel (1982) defines adaptivity at @ for an estimate (Z,,) by 
(i) For every sequence (¢,) in O, such that (n!/*(t, — 6,)) is bounded 
L(n(Z, — ty )IP,,,0,) = 4 (0, 1°(8)), 


where I(9) = 4 {pgp} dv. 
Condition (i) is equivalent to 


1 n 
(i) n| Z,—0,-— YU(X,:6)| +0 in Pyprob., 
jel 
where L(-, 0) = I71(8)2 fy "9x4, >0)- 


This follows from Theorem 6.3 in Fabian and Hannan (1982) and Theorem 6.1 in 
Bickel (1982) and the note thereafter. Thus an estimate adaptive in Bickel’s 
sense is AI{@) if and only if pj = pọ. Bickel claims that the existence of 
estimates adaptive in his sense implies the necessary condition for adaptive 
estimation which would imply that such estimates are automatically asymptoti- 
cally linear. But the proof of this claim is incorrect due to an inappropriate 
reference to Hájek (1972): Bickel considers only local alternatives of the first 
component of the parameter ĝ and not local alternatives of both components as 
needed in Hájek’s Theorem 4.2. 


We shall now consider the construction of asymptotically linear estimates. We 

begin by introducing the following assumptions. 

(A.1) The map ¢ € 8, ~ p% o} is continuous for all o € @,. 

(A.2) (U,) is a @,-valued estimate such that (n!/*(U, — 6,)) is bounded in 
P;-prob. for all @ € 8. 

(A.3) For every n = L2... is a measurable map on S X ©, X S” into R? 
such that for each 8 € O and every sequence (t, in ©, for which 
{n'/(t, — 8,)) is bounded 


(2.3) ni? fiC, ta X1,.-., X,)f (+, ty») dv > 0 in Prprob. 


and 


(2.4) ful. tn» Xis Xa) az lal, tns C, tns b2) d» — 0 in Prprob. 


1142 A. SCHICK 


Assumption (A.1) corresponds to Bickel’s condition UR(ii). It allows us to 
conclude that 


(2.5) nV Z (ty, 92) — Z,(9)) > 0 in Prprob. 


for every 0 € @ and every sequence (/,) in ©, for which (n’/*(¢, — 6,)) is 
bounded. 

Assumption (A.2) is Bickel’s condition GR(iv) and is obviously necessary for 
the existence of asymptotically linear estimates. 

Assumption (A.3) generalizes Bickel’s condition H. We replace his requirement 
that LC, +, X,,..., Xn) is #valued, i.e., 


(2.6) SUC, O1, Xie Xa IC 8) do = 0 for all ð e @ 


by the weaker (2.3). Bickel argues that his condition S* which suggests (2.6) is 
“heuristically necessary” for adaptive estimation in convex models. Condition S*, 
however, is not necessary in general, i.e., there exist nonconvex models for which 
S* does not hold but adaptive estimates exist. For an example see Section 3. If S* 
does not hold the set 


X= (n: h is a map on S x @, into R” such that 


JaC, 0) iC, 0) dv = 0 for all 6 e o) 


may not be large enough for Bickel’s condition H to hold. In view of this (2.3) 
appears to be a necessary improvement. 

Next we introduce our estimate. For technical reasons we adopt Bickel’s idea 
of splitting the sample, but modify it to obtain better estimates of the score 
function. Bickel splits the sample in two unequal parts, estimates the score 
function based on the observations in the smaller subsample, and evaluates the 
estimate of the score function only at observations of the larger part. We divide 
the ample in two equal parts, obtain an estimate of the score function from each 
part, and evaluate the estimate of the score function obtained from the first part 
only with observations from the second part and vice versa. Thus our estimates 
of the score function are based on half of the sample and not just on a small 
proportion of the sample. Our estimate is formally defined by 


(2.7) 2, = Ü, + = xn, (X U) + È Lu( XG), 
J=kąat1l 
where k, is the integer part of n/2, Lm =h (3t sXe) Ln = 


lael a 4v” Xn), and a ae (U). For a discus- 
sion and the use of discreta estimates we refer to Fabian and Hannan (1982) and 
Bickel (1982). 


THEOREM 1. If assumptions (A.1), (A.2), and (A.3) hold, then (Z,) is 
asymptotically linear. 


ASYMPTOTICALLY EFFICIENT ESTIMATION 1143 


Proor. Let 8 € @. We must show that (Z,) is AL(8). Since (U,,) is discrete 
it suffices to show that 


1 
ta +> 
n 


is AL{@) for any sequence (¢,) in @,, such that (n'/*(¢, — 6,)) is bounded. Fix 
now such a sequence (¢,,). In view of (2.3) and (2.5) it suffices to show 





j=l jrk,tl 


3 Laal X,, tn) + É Lal X, o) 


ka 
nE (Laa X, ta) — lal X; tas 02)) > 0 in Py-prob. 


gel 
and. 
n E (LalXZ, ty) — lX, ta, %)) 2 0 in Pyprob., 
J=k,tl 


where Lpi, tn) = Las ta) — [Lal's te) FCs tas 82) dr, i= 1,2. 
These two statements are proved exactly as is (3.7) in Bickel (1982). We omit 
the details. This concludes the proof. 0 


REMARK 2. Actually, more can be shown. Suppose (A.1), (A.2), and the 
conditions of (A.3) except possibly (2.3) hold and 


1 ka k 
Z({t)=t+—| E Lal X t)+ © Lal(X, t). 

n jl JPHk,+1 
Then the following are equivalent for each 8 € 0. 
(a) For every sequence (¢,) in 8, such that (n’/*(t,, — 0,)) is bounded (Z,(t,,)) 

is AL(@). 
(b) For every sequence (¢,,) in O, such that (n/*(¢, — 8,)) is bounded 
ni? fLl , t,)iC- rb, b) dv>Q0 in Pyprob. 

The proof is easy. Fix 6 € @. Let 


n- 





k k 
R(t) = = Laal, E)E t, 0a) dv + ———* fLaC s E)E, t, 82) dv. 


In the proof of Theorem 1 we have shown that (Z,(t,) — R,(t,)) satisfies (2.2) 
for every sequence (f,,) in ©} such that (n'/*(t,, — 0,)) is bounded. This did not 
require (2.3). Thus (a) is equivalent to 


(c) For every sequence (¢,) in @, such that n’/*(t, — 6,)) is bounded 
n'?R(t,) 20 in P-prob. 
And this is easily seen to be equivalent to (b). 


n 


1144 ` A. SCHICK 


Obviously we do not want the eee linearity of (Ž n) to depend on the 
way we discretize, i.e, we want (Ž nð to be asymptotically linear for every 
discretized version (U,) of (Un ). This is equivalent to requiring (a) for all 0 € ®. 

The above shows that, in the presence of the other assumptions, (2.3) is 
necessary for ¢ Z,) to be asymptotically linear for every discretized version (U,) 
of <U,). 

We shall now discuss (2.3) in more detail. For this discussion we suppose 8, is 
a topological space, J, is measurable and satisfies, for every 8 € 9, 


[Utes to) ~ 14, 6) IPPC, £6) dv > 0 
as £ — 0, and v > 6,, and 


(C.1) (V,) is a ©,-valued estimate which satisfies V, > 6, in P,-prob. for all 


ð € © and 

(C.2) For every 6 € © and every sequence (¢,,) in @, such that (n!/*(t¢, — 6,)) is 
bounded 

(2.8) nl Rlta Vab) 20 in P,prob., 


where Q is the map on 8, X 8, X @, to R? defined by 
(2.9) Q(t,0,w) = flg(-,t,0)f(-, t, w) dv 


if the integral is well defined and 0 otherwise. 


It is now easily seen that (A.3) holds with 7,(-,-, X,..., Xn) = ls(-, © Vp). Note 
that (2.8) corresponds to (2.3). 


Condition (C.2) implies a certain rate of convergence for the estimate (V,). Q 
measures how difficult the construction of asymptotically linear estimates is by 
specifying this rate. In particular, if Q = 0, no specific rate is necessary. Bickel’s 
condition S* implies Q = 0 and this explains why Bickel is able to construct 
adaptive estimates using only a small fraction of the sample to estimate the 
nuisance parameter. 


REMARK 3. In a thesis Huang (1982) considers a different method of con- 
structing asymptotically linear estimates. His estimate is a solution of the 
equations 


(*) J(=, t, V) fi”) do(x) = 0, 


where (V,) is an appropriate estimate of the nuisance parameter 6, and ¢f,) is 
an appropriate estimate of the density fẹ. He proves that this estimate is 
asymptotically linear if it is consistent and if strong additional regularity condi- 
tions hold. These regularity conditions severely limit the use of his estimate and 
proving consistency of his estimate may pose difficult mathematical problems. 


ASYMPTOTICALLY EFFICIENT ESTIMATION 1146 


We feel also that our estimate is easier to calculate, since a solution of (*) may 
require much more extensive calculations. 


3. Examples, This section serves two purposes. It illustrates the above 
results and provides an example of a model for which adaptive estimates exist but 
condition S* does not hold. 

Throughout ~ denotes the set of all real valued functions v on [0,1] which 
are absolutely continuous with square integrable derivatives and satisfy 
fè v(t) dt = 0, and g denotes a Lebesgue density which satisfies 


(3.1) [oe(y) ay = 0, 
(3.2) ray) dy = 07 < œ, 
g is absolute continuous with finite Fisher information 
(3.3) (a'(y))” 
Soy Y 
and 
(3.4) fLl- 8) - L(y))’a(y) dy = oft), 


where L = — Jg (8'/8)Xig >07 

Now consider the regression model 
(3.5) Y=6,+6,(T) +e, 
where £ and T are independent random variables, € has density g, T has uniform 
distribution on [0,1], 6, is an unknown real number, and @, is an unknown 
function in %. 

This model belongs to a class of models which has been recently proposed by 
Engle, Granger, Rice, and Weiss (1986) and is of considerable practical interest. 
For generalizations and related models see also Wahba (1984). 

For our model, ©, = R, 8, = Y, S = R x [0,1], v is the Lebesgue measure on 
the Borel field of S and the densities fẹ are given by 


fo(x) = a(x, — 9, — (x2), xes. 
From (3.3) we obtain that (2.1) holds with 
g 
plx) = =op r m Ne: x ES. 
See Hájek (1972) for details. From Theorem 9.5 in Rudin (1974) we can derive 
that 6, E R — p, is continuous in L(y). 


- The tangent space J,(@) is the L,(v)-closure of the set of functions y in L,(v) 
of the form 


W(x) = v(x2) p(x), xES, 
for some v € ¥. Thus p, is orthogonal to J (8), p? = Pe, and 1,(x, 8) = 


1146 A. SCHICK 


L(x, — 0, — 9,(x2)), x € S. Consequently, the necessary condition for adaptive 
estimation holds. Also, Bickel’s condition S* holds if g is the standard normal 
density. But Bickel’s condition S* does not hold in general; e.g., if g is the double 
exponential density, g(y) = te~!, then L(y) = sign(y) and with Q as given 
in (2.9) 


Q(t, 8, + D, b) = Q(0, v,0) 
= f° siga(y - o(t))se-" dy ae 
0 "%-a@ 


ae) = [sign(o(t))(en!'— 1) dt 


= ['sign(o(£))(enl!— 1 -|0(t)]) at 


= of ft) at), 
but Q + 0. 


Suppose now that (Y,, T,), (Y2, T,),... are independent copies of (Y, T}. We 
have already seen that (A.1) is satisfied, and it is easy to verify that (A.2) holds 
with (Un) = (Ya) = (1/nXj_,Y,). Thus we are left to show (A.3). We shall 
construct an estimate (V,) which satisfies 


(87) f'V,(t)dt=0 and Ef (V(t) ~ O(t) dt = O(n-*”). 


This implies (A.3) since there are functions (Z,,) such that 


(38) [fala = o) = Le) dat > 0 
and 
(3.9) mA f° JLS = on(t))a(y) dydt > 0 


whenever f(v,(t) dt = 0 and {jv2(t) dt = O(n~?/*). For many important exam- 
ples of g, such as the normal or the double exponential density, we can choose 
L,, = L. If this choice is not possible the functions (L,,) can be constructed as 
follows. For a sequence (7,,) of positive numbers such that 7, -> œo and nr; —> 
œ, set 


An fanl -m x) ¥(x) dx, 
where A,, = (—1,) V L A 7, and y is the standard normal density, and define 


n= An JAIE) dy. 


ASYMPTOTICALLY EFFICIENT ESTIMATION 1147 
It is easily verified that the sequence (L,,) satisfies 
JLA) dy = 0, 


fL) - LEl) ay > 0, 
and 
\ sup|L(y)| = O(7*!), i= 0,1,2. 


The statements (3.8) and (3.9) are now readily derived from this, e.g., 
va [gly ~ og(0))a(9) drat 


= ni? f {(L,(9— v(t) — Lay) + 24(t)LO(y))a(y) dydi 


z ofn a0 at) =a) 


if fo 0,(t) dt = 0 and fo olt) dt = O(n-*”*), 

We shall now construct the estimate (V,,). For a related construction, under 
slightly stronger assumptions, see Stone (1985). His results show also that better 
rates of convergence are possible under additional smoothness conditions on 6. 

Let (a,,) denote a sequence of positive integers and set b, = aŭ '. For each 
n= 1,2,..., partition the unit interval [0, 1] in a,, intervals I„„ i = Ay: OF 
equal length b„ and let m,, denote the midpoint of J,, and x,, the indicator of 
I,,,- We assume that the intervals I„, are numbered in such a way that m,,, < m,, 
forl <j < k < a, Next set 


and 
—-i n 
Ya = (Da) E Yxa) i=.. ap 
gw 


and define V,, by 





Yn — Up Ostsm,, 
E~ Mu 
(3.10) V(t) = Yui TA Un + b (Yni: Ds Yn)» M pni st< Marti 
n 
Yra; 7 Un» Mra, st<l. 
It is easily verified that 


[v dt = PRES U, = 0. 


t= 


1148 A. SCHICK 


Lemma. If the sequence (a,,) is chosen such that 
. 
31 
(1) Th 


then 
E f (VAE) ~ 0,(0))' dt = O(n). 


Proor. For i = 1,2,..., a, set 


n 


(2) = an f Xni()0,(u) du 
0 
and note that 
(3) EX = 6, + Cue 
Easy calculations show that for some constant c 
c 

(4) Eo(Ynu = 6, = OR <s nb, 
and 

ee 
(5) E,(U, ~ 6,)* < — 


Next note that by the Schwarz inequality for0 < t< u <1 


(6) (x(t) = u)? = ( fB) ae) s (wt) fU) a 
Using (2), (6), and Jensen’s inequality we obtain 
SOLE) — Cau)? x malt) dt 


= fan fE) ~ 04(e))x nu) du) xt) de 


i < a, f f(6(t) — 6(2)) x ns(w) du xX q(t) dt 
< b? { (85(%)) Xqe(x) dx 
and = 
(Cpr = Ca)? = (af fOO ~ Baxmal) dudt) 
(8) < a f f(0(t) ~ 0(u))’ Xn 1(u)Xm (t) dudt 


< 2b, f (02) (Xn1(2) + Xal2)) dx- 


ASYMPTOTICALLY EFFICIENT ESTIMATION 1149 


Combining (4) and (8) shows that for some constant C 


(9) EA = Yp)? < C(b? + bp). 
tl] 

Next define 

(10) T= È Yam- Us 

and = 

(11) V, = Š Cika 


It follows from (1) and (9) that $ 
a,—1 
E; {(V,(t) — V,(t))’ dt E(X = Vb 
(12) afí nt) A(t) s 2 a ni+l m) n 
` < C(n™’b;' + b?) = O(n?) 
and from (1), (4), and (5) that 


(13) Eg f(V,(t) — ¥,(t))” dt = O(n). 
Furthermore by (1) and (7) 
(14) SOE) — 0,(t))° dt = O(n-*”). 
Combining (12) to (14) gives the desired result. O 
It follows from the above that 
Za a+ $| È L-T- VaT) È LE- Dy Val) 





is asymptotically linear and adaptive, where ( U, is a discrete version of 
(Un) = (1/nZ}_.Y,), (Ln) is as described in (3.8) and (3.9), &,, is the integer part 
of n/2, and (V,,) and (V,,) are the versions of (V, based on the first k, and 
the second n — k,, observations, respectively. In particular, we obtain 


Ž 


n 


= 1 
=U,+— 





kn 7 
¥ sign{¥,- T,-ViolT)) + È sign(¥,- U,- Va(T,)) 
J=l juk,tl 
is an adaptive estimate, if g is the double exponential density. This provides an 
example of a model for which Bickel’s condition S* does not hold (see (3.6)) but 
adaptive estimates exist. Of course, Bickel’s convexity condition C does not hold 
for this example. 

We conclude this section with remarks dealing with extensions of our model 
(3.5). i 


1150 A. SCHICK 
REMARK 4. The above results are easily extended to cover the model 


P 
(3.11) Y= } BA(T)+uw(T) +e, 

tml 
where e and T are as in (3.5), 8;,...,8, are unknown real numbers, /,,..., A, 
are square integrable functions on [0, 1] such that the matrix H = [ fh,(t)h, (t) dt] 
is the identity matrix, and w is an unknown function on [0,1] that has a square 
integrable derivative and satisfies fw(t)h,(t) dt = 0, i = 1,..., p. For this model _ 
6, = (B,,---,8,) and ba =w. Note that (3.5) is the special case p = 1 and 
h, = 1. Let h = (h,,...,h,). Easy calculations show that the necessary condi- 
tion for adaptive estimation holds and that 


1,(x, 8) = h(x_)L(x, — OF A( xy) — 0,(x)), xES. 


Also, verify that 
2 


1 
=o) 
Ey f (WCE) = (2)? dt = O(n), 


where W, = V, - EP (AV,(oh,(t) dth, with V, = V, + 1/nX}_1Y, and V, as in 
(3.10). It “follows as above that 


T, + ‘| E A(X,)L,(¥ - T,- Wal) 


n yel 


E YA(T) - 0, 


N jml 














and 


n 
+ È A(X)L (Y, - U- WalT)) 
J=ka+t1 
is an adaptive estimate, where (U,,) is a discrete version of (1/nÈ =Y, A(T), 
(L,,) is an appropriate modification of L in the spirit of (3.8) and (3.9), and 
(Wa) and (W2) are the versions of (W,) which are based on the first k„ and 
second n — k,, observations, respectively. 


REMARK 5. A more realistic version of (3.11) is to assume that also the 
density g is unknown, but satisfies (8.1) to (3.3). The nuisance parameter in this 
case is 0, = (w, g) and the necessary condition for adaptive estimation holds if 
and only if {h,(t) dt = 0 for all i = 1,..., p. This model deserves further investi- 
gation. We believe that asymptotically linear estimates can be constructed for 
this more general model. 


Acknowledgments. This paper is based on my doctoral dissertation. I am 
grateful to my advisor, Professor Václav Fabian, for his generous guidance and 
encouragement. I am also grateful to Professor P. J. Bickel for the Engle, 
Granger, Rice, and Weiss reference and to the Associate Editor for providing 
several references. 


ASYMPTOTICALLY EFFICIENT ESTIMATION 1161 


REFERENCES 


Braun, J., HALL, W., HUANG, W. and WELLNER, J. (1983). Information and asymptotic efficiency in 
parametric-nonparametric models. Ann. Statist. 11 432-452. 

BICKEL, P. J. (1982). On adaptive estimation. Ann. Statist. 10 647-671. 

ENGLE, R. F., GRANGER, C. W. J., RICE, J. and Weiss, A. (1986). Semiparametric estimates of the 
relation between weather and electricity sales. J. Amer. Statist. Assoc. 81 310-320. 

FABIAN, V. and HANNAN, J. (1982). On estimation and adaptive estimation for locally asymptoti- 
cally normal families. Z. Wahrsch. verw. Gebiete 59 459-478, 

HAJEK, J. (1972). Local asymptotic minimax and admissibility in estimation. Proc. Sixth Berkeley 
Symp. Math. Statist. Prob. 1 175-194. Univ. California Press. 

Huane, W.-M. (1982). Parameter estimation when there are nuisance functions. Ph.D. dissertation, 
Univ. Rochester. 

Rupin, W. (1974). Real and Complex Analysis. 2nd ed. McGraw-Hill, New York. 

STEIN, C. (1956). Efficient nonparametric testing and estimation. Proc. Third Berkeley Symp. Math. 
Statist. Prob. 1 187-195. Univ. California Press. 

STONE, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 18 689-705. 

Wausa, G. (1984). Partial spline models for the semiparametric estimation of functions of several 
variables, Proceedings of the Seminar on Statistical Analysts of Tune Series. Japan—US. 
Joint Seminar. 


DEPARTMENT OF MATHEMATICAL SCIENCES 
STATE UNIVERSITY OF NEW YORK 
BINGHAMTON, NEW YORK 13901 


The Annals of Statstics 
1986, Vol. 14, No 3, 1152-1170 


ASYMPTOTIC BEHAVIOR OF THE EMPIRIC DISTRIBUTION 
OF M-ESTIMATED RESIDUALS FROM A REGRESSION 
MODEL WITH MANY PARAMETERS! 


By STEPHEN PORTNOY 
University of Illinois 
Conder a regression model Y, = x/8 + R, :=1,..., n, where {R,} are 
iad. with c.d.f., F; x, € RP and BE RP. Let È bea “M-estimator defined 
using kernel, 4, let È n(x) denote the empiric distrıbution of the residuals, 
Y, — x/B, and let #* be the empiric c.d.f. of the errors, {R,}. Under suitable 
smoothness conditions on y, F, and the density F’ =f and conditions 


requiring essentially that {x,} behave like a random sample from some 
distribution in R”, it is shown that, for fixed x, 


Vn (B(x) — B(x) - H,(2)) - eat) >, 0, 


where g(x) = af(x)p(x) + bf (x) and H,(x) = (1/nd)f(x)ZT_\Y(R,) if the 
design has a constant term [and H,,(x) vanishes otherwise]. A tightness result 
shows that if p/n >c, n(Ê, (x) — F(x)) converges weakly to a Gaussian 
process with drift given by the bias term cg(x), and covariance function 
strongly affected by H,(x) and different from that for the usual Brownian 
bridge. In the course of the proof, an expansion for the fitted values, x ‘B, is 
obtained, with error O,( p''/* In?n/n?) = 0,(1/ Yn) if p?/n is bounded. 


1. Introduction. The use of residuals in analyzing linear (and nonlinear) 
models has become extremely widespread. The basic results' here concern the 
asymptotic behavior of residuals from regression models when M-estimators are 
used and the number of parameters is permitted to grow with the sample size. To 
be precise, consider the general linear model 


(1.1) Y=x/B+R,, i=1,...,n, 


where {R,,..., Rn} are iid. with cdf. F, {x,,...,%,} are (fixed) vectors in R? 
and 8 e RP. Let y be a given kernel and define the M-estimator, 8, to be any 
solution of the vector equation 


(1.2) 0= Yaw ~ xB). 


For the results here, we assume 8 = 0 without loss of generality. Define F(x) to 


Received October 1984; revised December 1985. 

Research partially supported by NSF Grants MCS 83-01834 and DMS 85-03785. 

AMS 1980 subject classifications. Primary 62G35; secondary 62E20, 6205. 

Key words and phrases. Regression, residuals, M-estimators, empiric c.d-f., asymptotics, 


1152 


REGRESSION MODEL “1163 


be the empiric c.d-f. of the residuals: 


(1.3) P(x) = ZEY, -aĝ s2)=— DUR, — x’ < x) 

nye N pm] 
(since B = 0), where J(-) is the indicator function of its argument. Let F*(x) be 
the empiric c.d.f. of the errors, {R,}. 

If F is normal and y(u) = u (least squares), the joint distribution of residuals 
is well known. In other cases, asymptotic computations will generally be neces- 
sary. If p is fixed as n > oo, classical methods can be used. However, in most 
applications, if n is large, models with large p will be considered. For example, in 
a regression model with five independent variables, n = 100 might be considered 
adequate for reasonable asymptotic approximation. However, in such cases 
quadratic models are often considered, providing a model with p = 21 parame- 
ters; and, thus p”/n is moderately large. The basic result presented here depends 
on whether or not the design has a constant term; that is, the first coordinate of 
x, satisfies 


(1.4) x, =1, fori=1,2,...,” 


The result (Theorem 3.1) is the following: let f(x) be the density of R, 
d = Ey(R) and o? = Var 4( R). Then 


in (B(x) - B(x) - H,(x)) - als) ai 
where 


ale)=sgal (x) + iaa) 
and 


Hye) = Ha) ÈR) 


if (1.4) holds and H,(x) vanishes otherwise. The H,(x) term arises from the 
estimation of the coefficient of the constant term. When p is fixed, it was 
considered in a regression setting by Koul (1969) and Pierce and Kopecky (1979), 
and in more general situations by Burke, Csörgő, Csörgő, and Révész (1979), 
Loynes (1980), and Shorack (1985) where its strong effect on the asymptotic 
distribution is discussed. If p is fixed and f(x) is known (e.g., in testing a simple 
null hypothesis), H,(x) can beappropriately estimated since dB, = (1/n)iy(R,) + 

1/ vn). mar by adjusting for H,(x) it is possible to construct a process 
WCE (Ê oe = B), which converges to the usual transformed Brownian bridge. It 
appears that this adjustment may be possible even if p — œ. If f(x) cannot be 
assumed, it too must be estimated [to order o(1/ Yn )], and this is a very difficult 
problem, particularly if p — oo. It should be noted that it is relatively easy to 
adjust for the bias term g(x) since all that is required is consistent estimators of 
f(x) and f(x) (which can be obtained under the conditions used here). 


1154 8. PORTNOY 


These results depend upon the asymptotic behavior of the M-estimators, B, 
which has been considered by Huber (1973, 1981), Yohai and Maronna (1979), and 
more recently by Portnoy (1984, 1985a). The first two references use more or less 
classical results, but require a stronger condition than p?/n — 0. The last 
references show that in the regression case (where {x,} “act” like a sample from 
some distribution in R?), ||Bi\? = O p/n), and max |x’B| +p 0 if p is suffi- 
ciently small compared to n. The author (1985a) shows that p*/7In n/n > 0 is 
sufficient but conjectures that p'**/n —> 0 should work. In fact Theorem 2.1 here 
extends the earlier result slightly obtaining a higher-order expansion of x’B (with 
four additional terms) with error of order (p"/*(In n)/n)*. This result clearly 
indicates the difficulty of using such expansions to try to verify the conjecture. 

Consistency results for F, were considered by Freedman (1981) and Bickel and 
Freedman (1983) in the case of least-squares estimators, and by Shorack (1982) in 
the case of more general M-estimators. These results were presented in the 
context of showing that the Bootstrap method based on residuals is consistent. 
Bickel and Freedman show that if p/n — 0 then the Mallows distance between 
Ê, and F tends to zero (in the “least-squares” case). They use this result to prove 
consistency of the Bootstrap distribution of a fixed contrast if p/n —> 0 and of 
the Bootstrap distribution of ĝ* (in R?) if p?/n —> 0. Shorack (1982) shows 
consistency of the Bootstrap distribution of a contrast in the M-estimator case if 
p*/n- 0. 


2. An expansion for xip. The main result, Theorem 3.1, requires an expan- 
sion of x/8 in terms of sums of functions {R,} with an error 0,(1/ yn) if 
p?/n = O(1). Theorem 3.1 of Portnoy (1985a) presents such an expansion with 
error terms involving x’B and which are shown to be O,( p?/*(An n)°/4/n) (which 
clearly is not sufficient). Theorem 2.1 provides an adequate expansion with 
appropriate error terms. i 

Since results from Portnoy (1985a) will be used, some of the conditions of that 
paper will be required and some further notation is needed. The conditions in 
Portnoy (1985a) relating p and n are weaker than the condition p?/n = O(1) 
used here. The conditions on include symmetry conditions (also required on the 
distribution of R) and three bounded continuous derivatives—a fourth bounded 
derivative is needed here. As noted earlier, conditions on {x,} are somewhat 
artificial since they are designed to hold only in typical regression cases where 
{x,} can be considered as a sample from some distribution in R?. They are 
generally stated in terms of the vectors 


y, =(X'X) x, i=1,...,n, 


and include equations (2.24) and (2.31) here. It should be noted that if (1.4) holds 
then by subtracting the column mean 7, from the jth column we can construct 
an equivalent design with y/y, = (1/n) + z/z, where {z,} can be expected to 
satisfy the conditions stated for y, (as defined originally). Using this fact, it is not 
difficult to show that the conditions (2.31) and those of Portnoy (1985a) hold if 
(1.4) holds (and the conditions hold for {z,}). Only condition (2.24) must be 


REGRESSION MODEL 1166 


treated differently depending on whether or not (1.4) holds. In Portnoy (1985b), 
the conditions are shown to hold in probability if the distribution of {x,} is a 
scale mixture of a multivariate normal, although it should be possible to gener- 
alize this distribution. 

THEOREM 2.1. Assume the conditions for Theorem 3.1 of Portnoy (1985a). 
Assume further that | has a uniformly bounded fourth derivative, and that 
(2.1) lim sup( p?/n) < B< œ, _ liminf( p?/n) > B, > 0. 

Then, uniformly ini = 1,...,n, 
(2.2) (xÊ) = A, + B, + C, + D, + E,+0,(1/vn), 
where, for some constants c,, C3, c3 and d = Ey R), 


A, = SLs WR, 


B, ZE L (9m, (94%) (Ri, (Y(R) - d), 
cod, ERINI IILI Y(R) V (R) Y (Rp) 


I 


on D, = L LL LH, AI mI ILIV (Ra) (Ru) 
x(v(R,) = d)y(R,,), 
E= aE L L LIAI Yn AI Ip AHIL” (Re) 


x¥(Ri,)¥(R,)¥(R,,). 


REMARK. It is possible to prove Theorem 2.1 without (2.1). In fact, the 
lower-bound condition in (2.1) is only used following (2.9) to obtain the ap- 
propriate order for term B,. If p?/n — 0, then B, must be included in additional 
error terms. Nonetheless, it can be shown that these error terms are of sufficiently 
small order; and the lower bound is not needed at all. Furthermore, the upper 
bound can be replaced by P!'/4In?n/n? — 0 [see (2.14)]. 


Proor. The results of Portnoy (1985a) are first extended to show that 
xiB = A, + O,( p?/n*/?-*) [see (2.17)]. This requires treating each of the terms in 
(2.4), generally by computing moments and using forms of Chebyshev’s in- 
equality. However, one term, (7,/V), requires a more accurate expansion for (x; ); 
and thus (2.17) must be inserted in a further expansion of y/V to obtain (2.2). 
The remainder of this section presents some details of this argument and also 
gives two technical lemmas. The casual reader may want to proceed directly to 
Section 3. 

(i) From Portnoy (1985a), Equation (3.16), 


(x/B) = (yÊ) = yW + yU + cyi V + ca yiSW + cyyx/SU + {SV 


(2.4) ae 
+c,9,Se, + ys (W+U+V+e,), 
km2 


1156 8. PORTNOY 


where using results from Section 3 of Portnoy (1985a) (uniformly in i = 1,..., n), 


pinn\i? j 
ywan O ET], WIP Op), 





; p? lèn 
yU=B, UIP = a, | 


IV = EEK Y(R, JY” (Rt), for some {Rt}, 
7/2 Jn 
(2.5) Ivi? = oa) 


S= È Lyns Int (Ra) (R), 


pnn)” 





sup{u'Su: ||u|| = 1} = o. 


p**(in n)” 
Also fror Section 3 of Portnoy (1985a), 
pan n 
CO malho E), E (Al Op), 


sm 





n pinn\? . 
(2.7) È (x,)¥(R) = 0,2") uniformly in i = 1,2,...,n. 
i=1 


Lastly, the following conditions will also be used here: 





(2.8) (xay = o|? 


and 


| uniformly in z # l 


Iyl? = of =) anonn: 
Consider y/U = B,: EB?* is a 4k-fold sum of terms of the form 
(WI, )-- (HH, JO Yı )-: “(yf yi ) 


M “ogy "2h 


xE|y(R, )(v(R,,) —4)--- ¥(R m (Y(R; N a d)|. 


Since Ey(R) = E(y(R) — d) = E4(RX 4R) — d) = 0, pairs of 1, and pairs of 
l, subscripts must be equal. If J, +, then (using 2.8), the contribution to EB?* 
is less than Bn?*(p In n/n2)?*_ If some l, = l,, (plnn)/7/n is replaced by 
p/n, but there is one less sum—thus reducing the contribution by a factor of n 


REGRESSION MODEL 1157 


and making it smaller. Hence, EB?* = O( p Inn/n)**, and, given e > 0, 





B plan 
pP | n ig B(in n)* 
ABI -5 z» for some i = 1,. sn} sn = er 

ia) 
for k such that 2ke > 1. Therefore 
(2.9) B,= o- md uniformly in i= 1,..., n 
From (2.1), (2.9) implies B; = 0,( p?/n*/2-*), 

Now consider y/V: using (2.6), (2.7), and (2.8), 
2 
IVs Lixin lxh) |e” (RE) 2 )¥(R,)| 
4 








(2.10) 3 
ol 252)" (== o(a] 
= P = SS |. 
p n? a P\ n32 
Now define 


(2.11) C, = ySW = 2 Ss ERIA HOLI Y(R) V (R, v(R,,). 


Following the argument leading to (2.9), the main contribution to EC?* arises 
when pairs of subscripts are equal. However, for each set of factors of the form 
(KIYAI YLA i,q)» at Most one pair of subscripts can be equal (without 
reducing the number of sums). Hence, 


2 2k 
EC? = O| n3* pinn\? (ey =O Pp lnn 
1 n? n n3/2 


Therefore [as in (2.9)], for any e > 0, 








ex gae 


Now define D, = y/SU. Using the bounds in (2.5), 

















pinn pinn p*(Inn)*”” 
(2.18) D, = a, | =O or . 
Similarly [using (2.1) also], 
Inn pílin 
ySV = a oe 
(2.14) 


P 2 3/2 


3 5/4 5 2 
(2.15) y/Se, = oj apr in |- o|] 


ofe) =f 


1158 S. PORTNOY 


Lastly using the bound on ||S|| and the geometric series, the final summation in 
(2.4), say e*, satisfies 


pipan 1 p’lnn 
(2.16) e* = 0, H ee yp - 0 Or x 


n Inn 
1 P 








n 


Therefore from (2.4) and (2.9) through (2.16), for e > 0, 


p? 
n3/2- a} 

(ii) Now reconsider y/V (2.5) and continue the Taylor series expansion of y: 
(2.18) yVv=) LX (9) %,)¥( Bi) [0 (R,) + + (xi WOR, )] (xi BY” 


= T+ T,, 


where T, uses the ”(R,,) term and T, uses the (x; #)y(R,) term. Inserting 
(2.17) in T, and using (2.8), (2.7), (2.5) [for A, in (2.17)] and (2.6), the sum over 
l, + l, # i in T, contributes 


yplnn [pinn plon p? 
BO on Vow WN a nial? 


k- ne | 


n? 


(2.17) (x/B) =A, + O, 








(2.19) 


=0 


P 


[using (2.1)]. It is easy to see that the sum for l, = 1, or l = i contributes a 
smaller-order term. Squaring (2.17) [and using (2.5)] and inserting in T, yields 


= E Eora,” (Rs) A; tol; =| 
(2.20) =E,+ o(a Er " 3 E (z=) 


n n 2-e 








p’? 
=E,+ 0,| Fez) for some e*, 


where E, is defined in (2.3). 
Lastly reconsider the error sum which can be rewritten 


(2.21) y»S*W+yS*(U+V+e,)+ x yS*(W+U+V+e,). 
k=3 


Computing E(y/S*W)** using the argument for EC?*, it is possible (though 


REGRESSION MODEL 1159 


tedious) to show that E(y/S?W)?* = O( p(n n)?/në)*. Hence [using (2.1)] 
p? pe 
yS’ W = o. aa = o. T] 
uniformly in z. Using (2.5), 


ade 7 P pln plinn p⁄lInn plon 
yS (U + V+ e) a (2 . T: A eaarow 








2 


pln? n 
n ? 


[oa] 
E y/S*(W+U+Vt+e,) 
k=3 








_ o (Ban) 
P n? . 
Therefore, the error sum is O,( p°/*/n?~*). Therefore, using (2.4) and combining 
(2.19), (2.20) (for »/V), (2.14), (2.15), and the error sum, 


p'/4(In n)? r Ta] 


n? 2~e 


(=i) =A, + B, + C, + D, + E+ O; a 


from which the result (2.2) follows. O 


To prove the main result, Theorem 3.1, certain results concerning random 
variables related to A,,..., E, in (2.2) are needed. In particular, define for each 
i= 1,..., n, and for l + i, 


~ 


1 S 1 

(2.22) A= GL (xy,)¥(R,),  Au= GL (9,)0(B,) 
jee Jail 

and define B, By Seer É, E, from (2.3) analogously. That is, a single subscript, 

i, involves y? and all sums avoid index i; a double subscript, il, involves »/ and 

all sums avoid both indices i and l. 

The results below will be stated using an error term slightly stronger than 
o,”. Given a sequence {y,} such that y, > +00, define A(y,) to be a random 
variable satisfying 
(2.23) E(4*(y,)) = o(1/yn)- 


Note: In Lemma 2.3, A will also be a uniformly bounded function on R? and the 
argument y,, will become a subscript, viz., A (u, v). 


66 


1160 S. PORTNOY 


LEMMA 2.2. Assume the conditions of Theorem 2.1 and assume further that 


either 
plny” 
Ex = of l 
dtr n 
or, if the design has a constant term [(1.4) holds], y/y, = (1/n) + z'z, with {z,} 
satisfying the conditions for { y,} [including (2.24)]. 


(i) Then, with d = Ey(R) and o? = Var ¥(R), 





(2.24) 





=o(1), if (1.4) holds, 


(2.25) SAG = 
z| ÈA = o(1), otherwise, 
l Š p i 
(2.26) E maA = ae = 0(1). 





(ii) Let X denote B, C, D, or E and let X denote B, C, D, or E as modified by 
(2.22). Then 


eee a > 
em) B(x] = of) and X,- i= Aa); 
yn t=1 
Note also that A, = A, + (1/d)\\yI7V(R,). 
Gii) Let V,= A, + B,+ C, + D, + E, with V, and V,, defined analogously to 
(2.22). Then 


(2.28) V=Alynjp), a= Alap), and V= A+ A(n). 


REMARK. Note that condition (2.24) is similar to conditions in Portnoy 
(1985a); particularly, condition X4. It is easy to see that (2.24) will hold in 
probability using the arguments of Portnoy (1985b). Also note that (2.24) is the 
only condition affected by the presence of a constant term [the conditions in 
Portnoy (1985a) hold with or without a constant term in the design]. 


Proor. The proof involves straightforward but very tedious computations. 
Since the calculations are all rather similar, only the case for D, (one of the more 
complicated cases) will be sketched. So, to obtain the first part of (2.27), consider 
E(2D,/ Vn. If (2.24) holds (i.e., the design lacks a constant), this involves a 
10-fold sum over i,,?, and eight subscripts };. As in the proof of Theorem 2.1, 
these subscripts must be equal at least in pairs. Thus, using the fact that Z, + i, 
or i, and (yi ¥,| < [1% I ly] < Bp/n, and using (2.24), 


ne 2 B 
Bl vd < Ee E/E EONO) 
n =l n h-u 'h & 


1 pin e 7In 
= (7? att = oP] = o(1). 


n n n 


6 
P 
nê 








(2.29) 








REGRESSION MODEL 1161 


If the design has a constant term, each factor (’y,) can be written (1/n) + 
(z/z,), and the product can be expanded to yield 


D=o EEEE Z + alleen) + (eizu) + (ziz) + (eiz) 
+ yl (eta, zien) + + leizetan) 


+ E [fete(iz eizu) +++ + iz Neie] 


Xy”(Ra) (Rp )(W(R,) - d)4(R,). 
Consider E((1/ Yn ED The 1/n‘ term contributes 


45 — y X E(8-fold sum) = Zol’), 
hi h 
since subscripts in the 8-fold sum must be equal in pairs. Similarly, since 
|z(2, < liz: lizi] = OC p/n), the 1/n? term contributes (1/n°)nO( p?/n?)O(n*) = 
O(p*/n*); and the 1/n? term contributes (1/n‘)nO(p1/n‘)O(n‘) = 
O( p*/n*) — 0. For the 1/n term, first consider the last term [without (z/z, )]. 
In each term in the expectation of the square of the sum, at least two of the six 
factors must have unequal subscripts so that (z/z,)’ = O( p In n/n’). Thus, the 


contribution is 
1 Inn a Sinn 
mo 2 Jo (ZJ = of 2 | > 0. 


Lastly, for the term involving (z/z, ), either i = 1, (which eliminates the sum over 
i and gives a smaller contribution), or i + J, (which allows the above argument to 
apply); thus proving the first part of (2.27) for D,. 

Equation (2.24) is not needed to obtain the second part of (2.27). Note that 
D,- D, involves sums over no more than three subscripts (since at least one 
subscript L, l», ls, or 1, must equal i). Thus, E(D, — D,)? is a six-fold sum of 
products of eight factors of the form (yf (Iy) From definition (2.3), at least two 
pairs of subscripts (},, l4) must be unequal (in each term). Hence E(D, — Ď,)}? can 


be bounded by 
lnn pê Tinn 1 
Bn?? E - (257) = of =|. 
n? n n n 








The remaining results in Lemma 2.2 follow using similar expectation calcula- 
tions. O 


LEMMA 2.3. Assume the hypotheses of Lemma 2.2, and define ( for i + 1) 
1 ; 
Gulu) = 5{(oi¥(w) + (xe A(v(w) - d) 


+4(u) E (W7 )VR,) - €)} 


L+ l 


(2.30) 


1162 8. PORTNOY 


and similarly for G,,(u). Assume 


(2.31) LX (HNI = 0 uniformly in i, l. 


itt, l 
Then, with A(n) defined in (2.23) and A,(u) uniformly bounded in u and of 
order M(y,); 


[2e n)?” 


~ 


(2.32) Ü = V+ G,(R,) + A(n) 

and 

(2.33) G,,(u) = A (uv). 

Furthermore, 

@) FEL(FE + Va) [E Gulu) au} = 000 


and similarly when i and l are interchanged. 


Proor. First note that G,(R,) = A,—A,,+ B,- B exactly. Consider 


t ue 


Č, — Cy = coy) V (R) LY (HMH Ip ¥(B,)¥(R,,) 


hy law, t 


+e(R,) LÈ (KINH IHY V (R Y(R) 


l leit 


+eo¥(R,) LY (3I) HNH IV (Ry Y(R, )- 


lLslg*t, l 


As before, using (2.31), 


E (first term above)” < B( yy) LY (AWANA MI? EA 


hlr, l 
f 2 f 2 F t + 2 
+Y) (IIn) + (91%, (rin, C94, ny) } 
lo“ n 


2 2 p? 3 
< BOP| E (imo) + ofa] 


1*6, 








n? n? nt 


1 {p*lntn 1\. 
na 50| 3 | =o(=). 


n n 


tes p(Inn)° a | 





The terms Ď, — D,, and E,— Ë, can be similarly bounded (with even smaller 
bounds); and, hence, (2.32) holds. Similar (even easier) computations yield (2.33). 


REGRESSION MODEL 1163 


Lastly, for (2.34) note that [by Lemma 2.2 and (2.33)], 
W(x + Vu) f° Gulu) f(u) du = (F(x) + Vuil) + AGE (2) 


x{ f Cotu) du + T40,(2)4(2) 


4 G A 
5 (Oula) ia))} + a(n). 
The computations are quite tedious, but expectations of each of the terms above 


can be computed (as in the earlier proofs), and it can be shown that 1/n times 
the double sum of each term tends to zero. O 


i 2 
+ 7Ai, 


3. The basic results. 


THEOREM 3.1. Assume Theorem 2.1 holds and assume the conditions for 
Lemmas 2.2, 2.3, and 3.3. Assume that p/ yn > c. Then for each fixed x, the 
empirical distribution of residuals [see (1.3)] satisfies 


(3.1) vm (f(x) - PME) - yl) - S SH) + Shox) +, 0. 
where 


(32) Ax(x)=~EUR, <=), ot=By(R), d= By(R), 


tal 


and 
H=) = A(x) È R., 


tw 


if (1.4) holds, and vanishes otherwise. 


Proor. First, as in Lemma 2.2, define 
(3.3) V.=A,+B,4+C,+D,+E, 


and define V, (which omits functions of R,) and V,, and ¥V,, analogously as in 
(2.22). Let &, = o(1/ Yn). By Theorem 2.1 and Lemma 2.2, with probability 
tending to one, 


(3.4) I(R,- xiĝ <x) <I(R,< V,+x+ 48,) <1, 
where 

1 5 
(3.5) L= I(B,- SilalIV(R,) <2 + + 8,). 


Since the reverse inequality holds with 6, replaced by —6,, it suffices to consider 
P defined using I instead of I(R, — x‘ < x). 


1164 S. PORTNOY 


The remainder of ae proof is concerned with establishing the result 


(3.6) FÈ- 1(R,<x)-U,) >p 
where 
(3.7) U, = V f(x) + 1Å?f (x) + POTO) 


This result will bé established by computing the second moment of (3.6). Notice 
that Lemma 2.2 [(2.25) and (2.27)] shows that f(x)LV,/ Vn >p H (x) (if the 
design has a constant) and L(A? — (o°p/d n ))/ Vn >, 0 inifoonly in i= 
1,..., n. Hence, (3.6) immediately yields the result (3.1) (since Elly? = p). 
Now write 
Pl 2 


FÈ (1,- I(R, < x) - U) 


i=] 


(3.8) = 2 3 E(I,-1(R,<x)-U,)° 


po ~ SSB, - I(R, < x) ~ Ü) - (R, < x) - Ù). 
rm] 


Consider the second double sum term. By Lemma 2.3, note that with G,, 
defined in (2.30), 


(39) =k SIC, )- G(R) + A(n) <x + Vit ô, 


Now let Z, denote = o-field generated by all R, except R, and R,. Expanding 
the product in the double sum in (3.8) and conditioning on S; yields nine terms, 
the first of which can be computed using Lemma 3.3 [see (3.21)] as follows (with 
Zy= x + V+ 6,): 


BILIS] = P(Za)F (Za) + lil Zu)fZa) (Zn) 


+ 2 ae f(Z,)F(Z,1) 


ene) {Ixl tK (Z F(Z) 


HIY K (Za) E(Za) + 2x? oll? o(Z.)Ko(Z,)} 
+f(Za) f” Galo) ilo) do + f(Zu) f” Gulu) {(u) du + Aln), 


where (using notation from Lemma 3.3) 


Kw) = S (Ralu) + Bi(u)) f(a) du, 
K,(w) = [ia (u)f(u) du = y(w) j(w). 


ore 


(3.11) 


REGRESSION MODEL 1165 


Expanding f, and 4 about x + v, and about x [using the fact that V, is of form 
A(yn/p ) by Lemma 2.2], 


1 m 
E([II 49a] = F(Z.) F(Z) + Glade + V,)o(x + Vi)F(Z,) 


1 ~ 
(3.12) + glate + Viola + Vi) F(Zu) 


+- { (Ila + lyt) Kul) F(x) 
qa { (lla! + Un) Ky 


+ 2lj NIN? E) yx} + A(n), 
where G,, and G,, terms have been omitted since their double sum tends to zero 
by Lemma 2.3. Similarly, the other terms of the following forms can be computed 
[using Lemma 3.3, (3.22)]: 


E[II(R, < ESIA = F(x + V,+ 8,)F(x) + _ Iy K (2) F(x) 
(3.13) 


+ SHouhe(e + Va) ie + Vy) R(2) + A(n), 
EONS) = f? JEH a+ Gaoi) 
(3.14) +Z (imo) Ay} "(u v) dudo + A(n) 


p 5 1 = = z 
= OyF(x + Vy + 8n) + Gy Daya + Vi) E(x + Va) + ACn), 


(3.15) E[ŪI(R, < x)|%,] = O,F(x) + A(n), 
where, again, G, and related terms are omitted (using Lemma 2.3). 
With some tedious computation, terms of the form (3.12) through (3.15) can be 


summed to obtain (ignoring terms whose double sum tends to zero when divided 
by n) 


E[(1,- I(R, < x) - U)(1,- I(R, < x) - G)|%,| 
= (F(x + V+ ôn) ~ F(x) - U,,)( F(x + Ün + ôn) — F(x) aa U;,) 


(336) 7 : nle(x + Vaile + Ca (F(x + V+ 8) — F(x) - n) 


+ d PARKIE t Vi.) f(x + Üp) {F(x + Vi + ôn) = F(x) = U1} 
1 
+ Sa IIP Hilf ECE) + A(n). 
Now expanding F(x + V,,+ 6,) using Lemma 2.2 [see (2.29)], 
= z 1 
F(x + Vit 8,) — F(x) ~ u= = Filyll?#(=)¥(z) + Ala) 


1166 S. PORTNOY 


(and similarly for i and / reversed). Thus, the first (product) term in (3.16) 
contributes 


1 
pa lll? NaF?) (E) + A(n). 


Similarly, the second and third terms in (3.16) each contribute the negative of 
this term; and these exactly cancel the last term in (3.16). Therefore, 


LELE, - UR, <x) - 04-1, < 2) ~ Ü) 


towel 


1 
= z LL EA(n) 


tm] 
1 
= of =ant| — 0. 
Lastly consider the first square term in (3.8). Using Lemma 2.2 and computa- 
tion similar to the above, it can be shown that E(I,- KNR, < x)- Ŭ)} > 0 


uniformly in i. Hence, both terms in (3.8) tend to zero, and the proof is complete. 
O 


COROLLARY 3.2. Following Theorem 3.1, define H,(x) as in (3.2) and 


(3.17) as) = E|) + F), 


If Theorem 3.1 holds and {x,,...,x,} are fixed, the joint distribution of the 
random variables {yn (F(x,) — F(x,) — H,(x,)) — g(x,)}4_, is the same as that 
of (vn (F*(x,) — F(x,))}*.,. If Theorem 2.1 holds, the processes {Vn (Ê (x) — 
F(x) — H,(x)) — g(x)}"_, are tight; and, hence, they converge weakly to the 
transformed Brownian bridge process to which {/n(F*(x) — F(x))}2_1 con- 
verges, if p?/n > c. 


Proor. The joint distribution result is an immediate consequence of Theo- 
rem 3.1. So it remains to obtain the tightness result. By Theorem 2.1, there is 
€„ > 0 such that with probability tending to one, 


xB — e, < x'Ê <x'B +e, uniformly in i= 1,..., n. 
Thus, with probability tending to one, 


E ot x x 
ar LAR, -xi <x+6)—WR,- x$ <x)} 





sup 
x 





se È (IR, < 2+8 + en) = ICR, < 2 en) 


tml 


(3.18) < }sup 
x 








1 n 

T IR, <sx+8-e,) -IR <sxte, 
Ta L {K ) )} 
Therefore, tightness follows from tightness of the usual empiric c.d.f. [see, for 
example, Billingsley (1968), Section 15], uniform continuity of g(x), and tightness 
of {vn H,(x)} [which follows from uniform continuity of f(x)]. 0 





+t sup 
x 





REGRESSION MODEL 1167 


REMARK. (1) The corollary implies that standard goodness-of-fit tests based 
on F, should work if p?/n—0 and the design lacks a constant, but not 
otherwise, unless g(x) vanishes. If (x)= —f (x)/f(x), then g(x) will indeed 
vanish. However, g(x) will generally not vanish. 

(2) Whenever the design has a constant, the asymptotic distribution of 
vn (F(x) — F(x)) will be affected [as in Pierce and Kopecky (1979)]. Without 
adjustment, these processes will converge weakly to the limiting process for 
Va (F'*(x) — F(x) + H,(x)) + g(x), which is a Gaussian process with mean g(x) 
and covariance function (for x, < x3) 


Y(x1, x2) = Cov{I(R, < x1) + f(x,)¥(R,)/d, I(R, < x2) + f(x2)¥(R,)/d} 


= F(x, V 29) = F(a) Pla) + Sil) f” WOI) d 


1 xX o? 
+ Flas) WO) dr + gatta). 
Lastly, the following technical result is required in Theorem 3.1: 


LEMMA 3.3. Assume Lemma 2.2 holds. For fixed values of i and l, fix {R,: 
j # i, 1} and [using (3.3) and (2.30)]; define 


h(r,s) =r- FIC) ~ (6 ~ 7a) 
(3.19) 


1 
alma d Wyl (r) _ G,,(s) t ALC 8), 
where ( from Lemma 2.3), A,, is uniformly bounded in (r, 8) and E8}, = o(1/n?). 
From now on, let A,, denote a generic function satisfying these properties. [ Note: 


Ed’ = o(1/n)]. 
nsider the transformation 


(3.20) U= ACR,, R,), V= h(R,, R,). 


Let f(-) be the density of R, and assume that I(u) = log f(u) has three bounded, 
continuous derivatives. Then the joint density of (U, V) satisfies 


fu, vlu ©) = Hao) 1 + 5 Ibu) + FRC) 


+Gu(o)l(u) + Ga(u)l(o) 

(3.21) + sr as (kalu) + kè(u)) 
1 

+ za la (ka(o) + £3(0)) 


1 
+ Sa ll? Hi (u) (0) + Apu, 0)}, 


1168 S. PORTNOY 


where 

ku) = y(u)l'(u) + y(u), 

k(u) = y*(u)l"(u) + (y(u)? + 24(u)y'(u)l (u). 
Similarly, consider the transformation 

U*=A(R,, R,), 

V*= R, 


(3.22) 


(3.23) 
Then 


fue, ve (u 0) = fu)i(o){L + FURU) + Gyo) u) 
(3.24) 


+ gahol) + Ku) +449}. 


Proor. First compute the matrix of partials, 
dh(u,v) dh{u,v) 


J du dv 
dh(v,u) dh(v,u) 
(3.25) du dv 
1 
Ps glu) Gi(v) 
= i ; 
Galu) 1- SIV») 


Then, using Lemma 2.3 [which shows that G,,(u) is of the form A g(u, v)] and 
the fact that || y,||? < Bp/n, it is not difficult to compute 


1 
1+ F lydy (o) + A glu, v) Ag(u, v) 
(3.26) J! = 


l : 
Ag (u, 0) 1+ <ilyIW(u) + Aalu, 0) 
Note that all second partials of h(u, v) are bounded by p/n times a bounded 


function of (u, v) with finite second moment. Since u — A(u, v) also has this 
property, the inverse function can be expanded as follows [since p?/n° = o(1/n)], 


u _,{u— h(u, v) A,(u, v) 
=(o) +4 f Like, n i heey 


Peete 

an 

~~ 
| 


One u + SHyiPy(u) + Gulo) + A,(u, v) 


o+ ZIY) + Galu) + Alu, ©) 


REGRESSION MODEL 1169 


Similarly, 


logdet J = og(1 ee PELO) 
d 
(3.28) 


1 
+og(1 = Slloil*¥(o)} + Anfu, o). 


Now, since R, and R, are independent, 
(3.29) log fy, y(u, v) = log( f(r) f(s)/det J) = U(r) + (s) — logdet J. 
So using (3.27) and expanding /(r) in a Taylor series, 


Ur) = (a) + (SHAW) + Gul) Cw) 
(3.30) 


1 
+ Sa IANCU) (u) + A, (u 0). 


Thus, expanding the log in (3.28) and inserting in (3.29), 


log f(u, v) = Hu) + F lalPRy(u) + Gulo) (u) 
1 4/2 r t 2 
+ sa lI (yuu) + (y'(u))’) 


+1(0) + Z sil%®,(0) + Gafa) Co) 


+ a Wa (4Co) o) + (W(e))’) + Aalu, v). 


Therefore, the result (3.21) follows from exponentiating; and (3.22) follows 
similarly. O 


REFERENCES 


BIicKEL, P. and FREEDMAN, D. (1983). Bootstrapping regression models with many parameters. In 
Festschrift for Erich L. Lehmann (P. Bickel, K. Doksum, and J. Hodges, Jr., eds.) 28-48. 
Wadsworth, Belmont, Calif. 

BILLINGSLEY, P. (1968). Convergence of Probability Measures. Wiley, New York. 

BURKE, M. D., Csorad, M., Csdro6, S., and RÉVEŚZ, P. (1979). Approximation of the empirical 
process when parameters are estimated. Ann. Probab. 7 790-810. 

FREEDMAN, D. A. (1981). Bootstrapping regression models. Ann. Statist, 9 1218-1238. 

Huser, P. (1973). Robust regression: Asymptotics, conjectures, and Monte Carlo. Ann. Statst. 1 
799-821. 

HUBER, P. (1981). Robust Statistics. Wiley, New York. 

Kou, H. (1969). Asymptotic behavior of Wilcoxon type confidence procedures in multiple linear 
regression. Ann. Math, Statist. 40 1950-1979. 

Loynes, R. M. (1980). The empirical distribution function of residuals from generalized regression. 
Ann. Statist. 8 284-298. 


1170 S. PORTNOY 


Pierce, D. A. and Kopecky, K. J. (1979). Testing goodness of fit for the distribution of errors in 
regression models. Biometrika 66 1-5. 

Portnoy, S. (1984). Asymptotic behavior of M-estimators of p regression parameters when p*/n is 
large, I: Consistency. Ann. Statist. 12 1298-1309. 

Portnoy, S. (1985a). Asymptotic behavior of M-estimators of p regression parameters when p2/n 18 
large, II: Asymptotic normality. Ann. Statist. 13 1403-1417. 

Portnoy, S. (1985b). A central limit theorem applicable to robust regression estimators. J. Mult- 
variate Anal, To appear. 

SyHorack, G. (1982). Bootstrapping robust regression. Comm. Statust. A—Theory Methods 11 
961-972. 

SHORACK, G. (1985). Empirical and rank processes of observations and resduals. Canad. J. Statist. 
12 319-322. 

Youal, V. and Maronna, R. (1979). Asymptotic behavior of M-estimators for the linear model. 
Ann. Statist. 7 258-268. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF ILLINOIS 
URBANA, ILLINOIS 61801 


The Annals of Statistics 
1986, Vol. 14, No 3, 1171-1179 


THE USE OF SUBSERIES VALUES FOR ESTIMATING 
THE VARIANCE OF A GENERAL STATISTIC FROM A 
STATIONARY SEQUENCE? 


By EDWARD CARLSTEIN 
University of North Carolina 


Let {Z,: -œ <1< +00} be a strictly stationary a-mixing sequence. 
Without specifying the dependence model giving rise to {Z,}, and without 
specifying the marginal distribution of Z,, we address the question of variance 
estimation for a general statistic t, = ¢,(Z,,...,Z,). For estimating Var{t, } 
from just the available data (Z,,...,Z,), we propose computing subseries 
values: tml 2491-99) Zam) OS t<t+ms<n. These subseries values 
are used as replicates to model the sampling vanability of ¢,. In particular, 
we use adjacent nonoverlapping subseries of length m = m,, with m, > oo 
and m,/n-— 0. Our variance estimator is just the usual sample variance 
computed amongst these subseries values (after appropriate standardization). 
This estimator is shown to be consistent under mild integrability conditions. 
We present optimal (i.e., minimum m.s.e.) choices of m, for the special case 
where t, = Z, and {Z,} is a normal AR(1) sequence. A simulation study 1s 
conducted, showing that those same choices of m, are effective when t, 18 a 
robust estimator of location and {Z,} is subject to contamination. 


1. Introduction. Consider this situation: a scientist is faced with data 
Za = (Z,,..-, Zn) from a stationary sequence {Z,: — co <i < +00}. He does not 
know what underlying dependence model (M) produced {Z,}, nor does he know 
the distribution (F') of the Z,’s. The latter may include large-variance contamina- 
tion. A statistic ¢, = t,(Z,,) is computed, e.g., a trimmed mean to estimate the 
level of the sequence, or a robust estimate of scale. In order to make any 
inferences from ¢,, an estimate of the variance of t, will be needed. Our objective 
is to provide a practicable and theoretically sound technique for calculating such 
a variance estimate—without assuming knowledge of M or F. This is accom- 
plished by using (as replicates) “subseries values” of the statistic £ computed on 
“subseries”: (Z 41s Z,499-++) Z,4m), OS t<t+me<n. The literature contains 
no other procedure to address this question in its full generality. Furthermore, 
even if specific assumptions were made about M (e.g., autoregression) and F (e.g., 
joint normality), actual calculation of the theoretical variance of ¢, in terms of 
the parameters of M and F may be intractable. This again points to the need for 
a nonparametric variance estimator for general statistics from dependent se- 
quences. 

The setting we address is more complex than the iid case due to the presence 
of M (be it known or unknown). Therefore the practical appeal of the bootstrap 


Received November 1984; revised September 1985. 

1Supported by NSF Grants MCS-8102725 and DMS-8400602. 

AMS 1980 subject classifications. Primary 62G05; secondary 60G10. 

Key words and phrases. Variance estimation, subseries, nonparametric, dependence, a-mixing, 
stationary. 


1171 


r 


1172 E. CARLSTEIN 


and jackknife estimates of variance applies a fortiori to our variance estimator: it 
“can be applied to complicated situations where parametric modeling and/or 
theoretical analysis is hopeless” (Efron, 1982). 

After presenting our basic notation and definitions in Section 2, we proceed in 
Section 3 to formally define our variance estimator and to discuss it in compari- 
son with other variance estimators in the literature. Section 4 establishes condi- 
tions under which our estimator is consistent in the L, sense. This consistency 
result is combined with a distributional result from Carlstein (1986) to yield 
asymptotic normality for general statistics from a-mixing sequences—with the 
limiting distribution being free of the nuisance parameter o°. In Section 5 we 
determine analytically the optimal choices of m (subseries length) for a useful 
class of special cases. Finally, in Section 6, using the results of Section 5 as a 
guide, we apply the variance estimator (via simulations) to precisely the sort of 
situation described at the outset of this introduction. 


2. Definitions and notation. Let {Z,(w): ~œ <i< +œ} be a strictly 
stationary sequence of real-valued random variables (r.v.) defined on probability 
space (Q, F, P). Let FY (F}, respectively) be the o-field generated by 
{Z,(), Z,4:(@),---} ({.--,Zg-1(w), Z,(w)}, respectively). 

For N > 1 denote: a(N) = sup{|P{A N B} — P{A}P{B}|: Ae Fy, Be Ky}, 
and define a-mixing to mean limy ,,a(N) = 0. 

Let tala ., 2,) be a function from R” > R}, defined for each n > 1 so that 

t,(Z,(w),...,Z,(@)) is F-measurable. pea: the argument w of Z,(-) from 
here on, we » denote Zi = (2,41, Z,49)-++, Zin) and ti = (Z); as a particular 
case: Z! = ee are Jn. 

For B >Ô denote: BX =X- Ņ|X|< B} and ”X = X —,X. Expectation, 
variance, and covariance will be denoted by E, V, and C, respectively. 

‘Random variables (Xn } will be said to be uniformly integrable (u.i.) iff: 3 no 
s.t. lima SUP, > n,H{I4Xn|} = 0. It will at times be convenient to use the 
equivalent condition: lim ,_, „lim sup, „ EF {|4X,|} = 0. 


3. The variance estimator. Most variance estimation techniques for gen- 
eral statistics have been aimed at the special case where {Z,} is iid. Tukey’s 
“jackknife,” Hartigan’s “typical values,” and Efron’s “bootstrap” [see Efron 
(1982) for descriptions] all make heavy use of exchangeability in their schemes for 
generating replicates of ¢. These techniques are based on the idea that by 
computing the statistic £ on subsamples of the data Z°, we can gain insight about 
the sampling distribution of ¢°. The bootstrap, for example, resamples data from 
the empirical distribution of Z?, and then recalculates the statistic £ on each of 
these “bootstrap” samples. These replicates of t serve as an empirical approxima- 
tion to the true sampling distribution of ¢°. This approximation is sensible when 
{Z,} is iid; but when nontrivial dependence is present in {Z,}, the true sampling 
distribution of £? depends on the joint distribution of Z°. Thus, the only 
subsamples that will yield valid replicates of ¢ are those that preserve the 
dependence structure in {Z,}. Therefore we shall focus on subsamples of the form 
{Z7:0<jsn-—m,n2m2}}. 


SUBSERIES VALUES FOR ESTIMATING VARIANCE 1173 


[Recently, Freedman (1984) has considered applying the bootstrap to a linear 
model with autoregressive component; this approach assumes additive iid per- 
turbations. Also, as he emphasizes, the bootstrap calculations assume that the 
user has correctly specified the form of the underlying autoregressive model. | 

We face several competing considerations in designing a variance estimator 
based on {t}: 0<j<n—m, n>m 1}. It is clear that the performance of 
such an estimator will depend upon how many representative subseries values t/, 
are used, how different the ¢/,’s are from each other, and how accurately the ¢/,’s 
model the behavior of t?. For a particular value of m, one would not expect t/, 
and t/*! to differ by much—especially in light of the dependence between Z/, 
and Dish Hence the collection of subseries values {t}: 0 <j <n — m} con- 
tains a great deal of redundancy that may not contribute information about ¢°’s 
sampling variability. The collection {t/": 0 <j <[n/m] — 1}, on the other 
hand, contains only nonoverlapping subseries values. If m grows with n, each t/™ 
will eventually behave as if it were independent of all but two of the other tJ’"’s. 
Furthermore, if m remained fixed, a subseries value t/, would never be able to 
reflect the dependencies of lag m + 1 or greater. These arguments suggest the use 
of (tJ: 0 <j <[n/m,] — 1}, with m, > œ as n > œ. 

Within this framework it seems reasonable to consider m, = [Bn] (0 < B < 1), 
since the corresponding ¢7*’s are based on subseries of the same order of 
magnitude as £? itself. Unfortunately, only about 1/8 disjoint t/”*’s of this form 
will ever be available. So an estimator based on such tj» ’s will never stabilize 
and home in on oĉ, even as n > œ. (Ironically, the bootstrap and typical- value 
methods use randomly selected subsets of the possible subsamples, since it is 
computationally impractical to use all the subsamples available.) 

In light of these factors we propose the use of the subseries values {¢/"*: 
0<j<k,-— 1}, where {m,: n21} are positive integers s.t. m,— œ and 
m,/n— 0 as n > œ, and k, =[n/m,]. Thus we obtain an increasing number 
(k,,) of subseries values, each of which is based on an ever-growing subseries 
(Z7”*); and each £j?» is becoming increasingly distant (m,,) from all but two of 
the other tpt” ’s. 

From this point on we will assume the following set-up: s} := s,(Zi) is a 
statistic that is wholly computable from the data Z}, and does not involve any 
unknown parameters. t} = (s} — E{s°})n'/? is the correct theoretical standardi- 
zation for s', in the sense that lim,,_, ,.E{(£2)?} =: o? € (0,00). The proposed 
estimator for o? is simply 

k,-l kal 
a (sum -5 A. /k,, Wheres, := } s Sm Rn 

i=0 1=0 
This is nothing more than the usual sample variance amongst the standardized 
subseries values {m}/78J™": 0 <j <k, — 1}. 


= m 


4. L,-consistency. In this section we work out some theory for subseries 
values. The first main result is a law of large numbers for these entities. This 
result is used to obtain consistency of 62. Finally, we arrive at an one oe 
normality result for t? in which the limiting distribution is free of o? 


1174 E. CARLSTEIN 


Let us begin with a useful truncation lemma: 


LEMMA 1. Let X be F} -measurable and Y be F;-measurable, q > p. 
Suppose max{E{X’}, E(Y?}) <C< œ. Then for any A>0O: |C{X,Y}i < 
4A?a(g — p) + 3C (ECX P} + (ECY). 


Proor. Writing X =,X + 4X we see that 
[C{X, ¥}| <|C{,X, .Y}|+|E{,X -4¥}|+|E{4X -aY }| +|E{4X -4Y}| 
+|E{1X}E{4Y}|+|E{(^X}E{4Y}|+|E{^X}E{*4Y}]. 
The first term on the right-hand side is bounded above by 4Aa(q — p) [Theo- 


rem 17.2.1, Ibragimov and Linnik (1971)]. The required bounds on the other 
terms follow from the Schwarz inequality. 0 


Applying this lemma we can establish the following law of large numbers for 
subseries values from an a-mixing sequence. 


THEOREM 2. Let {Z,} be a-mixing and let [,(Z',) = fs be a statistic. Let 
{m nz 1) be s.t. m, > œ and m„/n > 0; let k, =[n/m,]. Define f, = 

yaad 

Lizo n Ma / k p If 


(2a) lim E{ te} =oeER', 
and 

(2b) ( pY are u.i., 
then 

(2c) h> ® asn > œ. 


Proor. By (2a) it suffices to show lim „ .V{ fa} = 0. Now 


Lv fjees E elr, f) 
< |2E{( fe, j+ E lols 0 fm Ne 


Ostsyssk,-1 
The idea here is that the covariance between nonadjacent f7”’s is dropping off 
as the separation (m,,) increases. So, although there are order k,, of these terms, 
their average becomes negligible as n — oo. 
Formally, we note first that [by (2b)] E{(f,°)*} are bounded uniformly in 
n > no by C < oo. Assume now that n is sufficiently large so that m, > no. 
Then for each j € {2,3,...,%, — 1} we have 


GEA F ma} ‘a(m,,) + 6ow>B( (470 yy)" = B(n, A) forany A > 0 


by Lemma 1. Hence: }V{f,} < 2C/k, + B(n, A) for any A> 0. Now take 
lim lim sup, _, ,.(-) of this last expression. O 




















A-> 00 


We are ready to prove L,-consistency of 67. This result follows in part from 
Theorem 2, since ô? is essentially a mean. 


SUBSERIES VALUES FOR ESTIMATING VARIANCE 1176 


THEOREM 3. Let {Z,} be a-mixing and let {m,} and k,, be as in Theorem 2. 
Let s}, t}, o°, 62 be as defined in Section 3. If 


(3a) (t°)* are w.i. 
then 
(3b) 62,07 asn> oœ. 


Proor. Write ô? = E, — (é,)?, where i, = ag th and 2, = 
Etag '(ti"")°/Rk,,. Clearly we only need to show È, —> ,,07 and (é,)* > , 0. The 
former follows from Theorem 2. 

In order to show #, > ,,0, note first that £, > p0 by Theorem 2. Therefore, 
by the mean convergence criterion [see Chow and Teicher (1978), page 98], it 
suffices to establish that (#,)* are ui. Now (é,)? < =,, so that for A > 0: 
E{(é,)‘T{(é,)* = A}} < E{(2,,)71{(%,)? = A}}. Hence we only need to show ui. 
of (Z,,)°. But by (3a) we know that E{(Z,)*} < oo when m, > no; and 2, > 1,0” 
as mentioned above. Thus the mean convergence criterion (converse) yields the 
required result. 0 


Notice that both Theorems 2 and 3 are logically independent of the question 
of convergence in distribution. These results give integrability conditions that 
guarantee L,-consistency of estimators based on the subseries values from an 
a-mixing sequence—regardless of whether the ¢°’s (or f,°’s) are converging in 
distribution. Furthermore, we have not constrained the mixing coefficient a or 
the subseries length m,„ in any way other than a(n) > 0, m, > œ, m,/n- 0. 
In practice the L,-consistency is desirable because it translates into shrinking 
variance and bias. 

We can now combine the variance estimation result (Theorem 3) with the 
distributional results of Carlstein (1986), and obtain: 


THEOREM 4. Let {Z,} be a-mixing and let {m,}, 81, t}, ô&2 be as defined in 
Theorem 3. If 
(4a) iim- (t%/tq)'*C{tp,, ten} = 0, 


Ta Z lp t+ 20,7 0 


i) imeup (18) = 20t, 
then (3b) holds, and also 

4e t? , tor bn 

( ) ( Tn ")/ Un / Tp > P?, Ty È Un + Op Z Op > 00 


[The generalized limit notation is the same as that defined in Carlstein (1986).] 





N,(0,0,1,1,p) Wp? [0,1]. 


Proor. We will begin by showing that (t°, t#=)/o >p N,(0,0,1,1,p), via 
Theorem 4 of Carlstein (1986). Since E{t°} = 0, it suffices to observe that (4b) 
implies that (¢°)* are wi. 

Next we want to use Theorem 3 to conclude that (3b) holds. In light of (4a) 
with u, = 0 and r, = v, = n, it is enough to verify (3a). But (3a) follows directly 
from (4b) together with t2 >, N(0, a°) (established above). 0 


1176 E. CARLSTEIN 


[Condition (4b) may of course be replaced by the less specific condition (3a).] 
The sample mean and sample fractile statistics are discussed as theoretical 
examples in Corollaries 8 and 10 (respectively) of Carlstein (1986). 


5. Optimal subseries length. The results of Section 4 gave an asymptotic 
justification for the use of 6. In fact, the asymptotics held for an extremely large 
class of sequences {m,,} of subseries lengths. In practice, however, the perfor- 
mance of ô? (for fixed n) will be greatly influenced by the particular choice of 
m,,- Our intuition tells us that by increasing m„ we should reduce the bias of 67, 
since our replicates (t/"-)? will more closely resemble the large-N (t9)? whose 
expectation is being estimated. Furthermore, as the dependence in {Z,} becomes 
stronger, we will need longer subseries in order for t/* to adequately model the 
dependence present in ¢2. On the other hand, by decreasing m,, (i.e., increasing 
k) we expect to eds the variance of 82, since more replicates become 
available. This interaction of bias, variance, and dependence will yield an optimal 
(i.e., minimum m.s.e.) m,,, for a given statistic t and a fixed n. Unfortunately, but 
not surprisingly, we are unable to make statements regarding optimal subseries 
length that apply with the generality of the consistency results in Section 4. We 
can, however, make very precise statements in the following special case. 

Let {Z,} be an AR(1) sequence: Z, = $Z,_, + Ep where 9 < land {2} are iid 
N(0, 1). The statistic 8n = Z? has asymptotic variance o? = (1 — $)~*, which is 
to be estimated by 6? = m „Etag (8i )?/k„ (E{s°} = 0). In this situation, we 
can explicitly calculate the effect of subseries length on bias and variance: 


o? — E{õ}} = 2ġe/a°cm, = 2¢/a*cm, + o(1/m,), 
and 
V(a2} = 2{ b?/a + 2m; | 3(¢? - 29 — 4)/b 
+ (b — 7d + 11¢d + 3¢(5¢? — 4¢ - 5)/b)/a? 
+ 3(3d + (g + 2g? + 3)/b)/a + m;,1(— 3¢f/be 
+ 6[(¢? + 24? + 84 — 2db + g*b)/c 
— (5 — 11g + 4¢? + 84° + 5g — d)/a 
+ (g? + 294 — 34? — f )/be + 2( 4% — g?)/c? 
+ (2d + 2g — 244 + ¢3 — 36)/a? + 3g + 5a] 
— ġe?/a? + 6p [3(1 + 26 — 4?) /b + 2(d? — 4°) /be 
+ (36? — 49? + 26 — 5d + 4d¢)/a] /a? 
+ [1 ~ (1 - d***)/k, f | 
x [1 - dt + 2d? — 2d + 2e?(¢ - d)/a] /fa®b)|\ /k,c%a 
= 2/a‘k, + o(1/k,); 
where a = 1 — 6, b=1+¢,c=ab,d=9™,e=1-d, f = 1 — d?, g = d/ẹġ. 


SUBSERIES VALUES FOR ESTIMATING VARIANCE 1177 


Variance and Bias of 2 with ¢=.9, n=100. 
(Bias) *: 


Variance c eenececocessors 


15000 


10000 


5000 





0 20 40 60 80 100 


Fic. 1. Variance and bias of &? with = 0.9, n = 100. (Bias)?: —; Variance: ---. 


Figure 1 illustrates the influence of m,, and &,, on the bias and variance in the 
case n = 100, p = 0.9. The jumps in V{é?} are due to abrupt changes in k,. 
Notice that V(62} increases with m,, even while the number of replicates remains 


Using just the first-order contributions from the bias and variance, we ap- 
proximate 


m.s.e.{62} = (4¢7/a‘c?)m;? + (2/a*)m,/n. 


1178 E. CARLSTEIN 


Hence the optimal subseries length is approximately 
my = (2¢/c)*n', 


with corresponding m.s.e 3(2¢/c)*/*/a‘n?/*, Observe that longer subseries are 
required as the dependence becomes stronger. 


6. Application. Let {Z,} be an AR(1) sequence: Z,=Z,_, + €, where 
|p| < 1 and e, are iid errors from the contaminated distribution (1 — 7)F,(-) + 
nF(-) [F{-) denotes the c.d-f. of a N(0,c*) r.v.]. A scientist observes Z°, and, 
suspecting contamination, he computes a 6% trimmed-mean (our s?) to estimate 
E{Z,}. In order to estimate the variance of s°, he will apply 62. [Gastwirth and 
Rubin (1975) give an expression for the asymptotic variance of 8? in terms of an 
infinite sum of Hermite polynomials—assuming, however, that {Z,} is a normal 
sequence. | 

We propose using the results of Section 5 simply as a guide in selecting an 
appropriate m,: the scientist can calculate 


n-i n 
g=n x (Zai g Z2)(Z, = Z)/(n T 1) L (Z, rt, zy 
i=l tm] 
as a preliminary measure of the strength of dependence in {Z,}. Based on this d, 
he can now estimate m*. Although the resulting choices of m, are not in general 
going to be optimal, this is a realistic strategy, given the amount of information 
available to the practitioner. 

The entire procedure described above was carried out on 200 realizations of Z°, 
with: 7 = 0.3, T? = 10, ô = 40% (20% in each tail), @ = 0.2 and 0.8, n = 100 and 
1000. Table 1 shows that this procedure yields reasonable results. A balance 
between variance and bias is maintained, and m.s.e.-consistency is exhibited. 
Moreover, the quality of the performance of ô? is not affected by the strength of 
dependence in {Z,}. 


TABLE i 


Simulation study of 6? as an estimator of the variance (0°) 
of a 40% trummed mean (8°).* 


$ oft n Eat Wa) V6) / mse)  m.s.e.(4t} / o* 


02 33 100 45(0.10) 135 0.57 0.31 
1000  4.0(0.03) 0.16 0.25 0.06 
0.8 88 100 50 (2.6) 1383 0.49 0.36 
1000 71011) 232 0.45 0.07 


~ “The data are from an AR(1) sequence with 30% contamination. Subseries lengths (m,,) are 
based on m*. 
tEach o? = limy o V(N'?s9} was estimated empirically by 200 realizations of N89 
with N = 200. 
tt Rach row was estimated empirically by 200 realizations of 32. An estimate of the standard 
deviation of E{42} appears in parentheses. 


SUBSERIES VALUES FOR ESTIMATING VARIANCE 1179 


Acknowledgments. I thank Professor John Hartigan, my thesis advisor at 
Yale University, for his guidance on this research. The suggestions of the referees 
and the Associate Editor were extremely constructive; in particular their com- 
ments led to the material of Sections 5 and 6. 


REFERENCES 


CARLSTEIN, E. (1988). Asymptotic normality for a general statistic from a stationary sequence. 
Ann. Probab. 14 1871-1379. 

Cxow, Y. S. and TricHEr, H. (1978). Probability Theory. Springer, New York. 

EFRON, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SLAM, Philadelphia. 

FREEDMAN, D. (1984). On bootstrapping two-stage least-squares estimates in stationary linear 
models. Ann. Statist. 12 827-842. 

GASTWIRTH, J. L. and RUBIN, H. (1975). The behavior of robust estimators on dependent data. Ann. 
Statist. 3 1070-1100. 

Tsracimov, I. A. and LInnIK, Yu. V. (1971). Independent and Stationary Sequences of Random 
Variables. Wolters-Noordhoff, The Netherlands. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF NORTH CAROLINA 
CHAPEL HILL, NORTH CAROLINA 27514 


The Annals of Stahsics 
1986, Vol 14, No, 3, 1180-1193 


A MINIMUM DISTANCE ESTIMATOR FOR FIRST-ORDER 
AUTOREGRESSIVE PROCESSES 


By CHamont W. H. Wana! 
University of Montana 


In this paper we construct a class of minimum distance Cramér—von 
Mises-type estimators for the parameter in the first-order stationary autore- 
gressive time series. The estymator is proved to be asymptotically normal 
under appropriate assumptions. The proofs involve some results of indepen- 
dent interest. 


1. Introduction. Consider the first-order stationary autoregressive model 
(1.1) X,=BX,,+U,, |Bl<1, -o<k<o, 


where {U,} are independent and identically distributed random shocks with 
E(U) =0 and 0 < o? = Var(U) < œ. This model has been widely used in 
applications for forecasting and control [see Box and Jenkins (1976)]. The 
constant § in (1.1) is the unknown parameter which we would like to estimate. 
Based on the observations {X), X,..., Xn} the commonly used least-squares 
estimate of 8 is given by 


n n 
(1.2) Bis = È Xr-iXr L XÈ 
kel gal 


It can be proved [see, e.g., Akritas and Johnson (1980)] that, within a wide class 
of density functions f of U, the best attainable asymptotic variance for the 
estimates of 8 under model (1.1) is 


(1.3) (1~ B?) 

where K(f ) is the Fisher information of f. Note that the asymptotic variance of 
the least-squares estimate is 1— 8? [Theorem 4.3, Anderson (1959)] which 
equals (1.3) for Gaussian disturbances but not for more general situations [see, 
e.g., Hajek and Sidék (1967), pages 16 and 17]. 

Akitas and Johnson proposed an alternative test statistic to compete with the 
normal theory least squares tests under the contiguity framework. Lai and 
Siegmund (1983) showed that the asymptotic normality of Bis is not especially 
good when n = 50, even for small |8| and normal shocks, and it deteriorates quite 
noticeably for 8 near 1. Lai and Siegmund proposed an estimator under a 
sequential sampling scheme to improve the fixed sample size least-squares estima- 
tor. 

In this paper we consider a minimum distance Cramer—von Mises-type estima- 
tor of 8 under model (1.1). This approach has been successful in the usual 


Received March 1984; revised December 1985. 

1 Now at Trenton State College. 

AMS 1980 subject classification. Primary 62L10. 

Key words and phrases. Randomly weighted empirical processes, stationary and ergodic processes, 
bounded functionals on L*-spaces, Lipschitz condition, conditional expectation. 


1180 


MINIMUM DISTANCE ESTIMATOR 1181 


regression model. See, for example, Williamson (1982) and Millar (1982), Millar 
considered a general regression model and proved that his estimator of the 
regression function is asymptotically normal, efficient, and qualitatively robust 
against certain contaminated errors. Veitch (1983) considered an estimator Ê 
which minimizes the distance (in a certain Hilbert space) between the empirical 
distribution Ê and the true distribution F’ of a general stationary time series 
{Xo X,,-.-,X,}. The estimator is proved to be asymptotically normal and 
qualitatively robust. Veitch’s result includes all Gaussian ARMA models with a 
finite number of parameters. 

Instead of assuming the knowledge of the error distribution F, we propose a 
nonparametric estimator for the first-order autoregressive processes. The motiva- 
tion of this estimator is as follows. Consider the randomly weighted empirical 
process 


(1.4) S(t, A) = L Xp- (Xps t+ AX,-1) 
k=l 


where J( A) is the indicator function of set A. Note that, by the completeness of 
L?-spaces, (1.1) can be rewritten as [see, e.g., Fuller (1976), page 29] 


œ% k—1 
(1.5) X = È PU, =B'X + E PU, 
j=0 J=0 


Hence X,_, and U, are independent, E(X,) = 0, for all k, and ES(t, 8) = 0. 
Furthermore, by denoting 


(1.6) F(t) = P(U, < t), G(t) = P(X, < t) 
for all k, we have 
(1.7) ES(t, A) = n fxF(t + (A — B)x) dG(x). 
We now claim 
(1.8) ES(t,A)20 ifA>B and ES(t,B) <0 ifA<B. 
To prove this, consider the case A > £ first. Note that 
x > 0 implies F(t+ (A — B)x) = F(t) and 
xF(t+ (A — B)x) = xF(t), 
x < Oimplies F(t + (A — B)x) < F(t) and 
xF(t + (A — B)x) > xF(t). 
This completes the proof of the first half of (1.8). The second part is done by the 


same argument. Similarly, we can prove that S(t, A) is nondecreasing in A for 
any fixed ¢. Hence, if we define 


(1.9) Q(A) = f” S?(t, A) dH(2), 


where H is a finite measure on (R, #), then similar to a rank test for testing H): 


1182 C. W. H. WANG 


B = B, [Hajek and Šidák (1967), page 103], H, is rejected for large values of 
Q(B). Therefore it is reasonable to define an estimator Â for £ by the relation 


(1.10) Q(B) = inf Q(A). 


In Section 2 we obtain the asymptotic quadratic approximation of Q(A) and 
then use this result to prove the asymptotic normality of Å. We also discuss a 
method to find H to minimize Var( £). 

Another nonparametric minimum distance estimator was proposed by Koul 
(1985). First note that by using an extra assumption that the error distribution F 
is symmetric around 0, F(t) can be estimated either by n-‘LI[X, < t+ BX,_,] 
or by n~'ZI[—X, < t — BX,_,]. Koul then considered an estimator of 8 as a A 
which minimizes 


(1.11) f[D%% <st+AX,)- I-X, <t- AX,_,)}]° dH(t), 


where H can be a o-finite measure. For the asymptotic behavior of this estimator, 
see Koul (1985). 


2. Main results. The following Proposition 1 concerns the asymptotic 
quadraticity of Q(A). The result will be used to study the asymptotic distribu- 
tion of B. 


PROPOSITION 1. If I( f ) < œ, H is a finite measure and E|U|* < œ, then 


(2.1) B{ sup nR) - QCA) 0 asn >o, 
y|4—Blsa 

where a is a fixed number and 

(2.2) Q,(A) = (S(t, 8) + n(A — B)o2#(t))° dH(t), 

(2.3) of = Var(X) = 67/(1 — B°). 


Proor. See Section 3. 


REMARKS ON THE ASSUMPTIONS. (f) < œ implies 
(2.4) f(x)>0 asx> +00 
[Hájek and Šidák (1967), page 20]. This in turn implies 
(2.5) f is bounded, 
since f is continuous, and 
(2.6) F satisfies a Lipschitz condition 
[Royden (1968), page 108] and 


(2.7) n(z) = sup( f(y) — f(y — 2))’ is continuous at 0, 


MINIMUM DISTANCE ESTIMATOR 1183 


where (2.7) holds by using (2.4) for large y and by using the fact that f is 
continuous and hence uniformly continuous on compact sets. (2.4)-(2.7) are 
crucial in the proof of (2.1). Indeed, we can replace I( f) < œ by the weaker 
assumptions (2.4) and (2.5) for the rest of the proof. We state these as assump- 
tions: 

(A.1) f(x) 70 asx—> +o, 

(A.2) f is continuous. 


Note that Q,(A), as defined in (2.2), is a quadratic function and has its unique 
minimum at A, where A is defined by 


vn(A- 8) = (- 103m ffžaH) [S(t BCE) dH(t) 
(2.8) A 
= (-1/08n ftar) È X, JIU, < t) f(t) dH(t). 


Thus we conclude: 

- PROPOSITION 2. Under the assumptions of Proposition 1, 
VECA- B) = ( -1/03 ffan 

(2.9) n 

x x X,- IU, < t) f(t) dH(t) + 0,(1). 

sx] 

Proor. See Section 3. 

THEOREM 1. Assume that (A.1) and (A.2) hold, that E|U|® < œ, and that 
H is a finite measure. Then ¥n(B — B) is asymptotically normal with mean 0 
and asymptotic variance o~*( {f*dH)~*((1 — B”)E(U*) + 280 + B)E*(U)}, 
where Ü = [KU < t)}(t) dH(t). 


Theorem 1 is a by-product of the following more general Lemma 1 and 
Theorem 2. 


Lemma 1. For the model (1.1), let 


(2.10) T,= È XÜ, 

k=1 
where Ŭ, = (U,), while § is a measurable function such that E(U?) < œ. Then 
(2.11) limn~1£(T,?) = 03{E(O?) + (EŬ) 8/0 — B)}. 


Proor. See Section 3. 


1184 C. W. H. WANG 


APPLICATIONS OF LEMMA 1. (a) If U = I(U < t)f(t) dH(t), then, by (2.9), 


ar(vn Ê) = ort| fran) (20) + 2E?(U)B/(1 — B)} + o(1) 


F a pat) (0 ~ B*)E(G?) + 2801 + B)E?(Ŭ)) + o(1), 


since (1 — r )o2 = o°. This gives the asymptotic variance of Ê in Theorem 1. 

(b) Let U = f[I(U < t) — K—U < t)] f(t) dH(t). This yields (3.15) of Koul 
(1985). 

(c) If O= f(U < t)- (-U < tI f(t) + f(—0] dH(2), then Lemma 1 
yields v? of Remark 5, Section 3 of Koul (1985). 

(d) If Ŭ = U, then E(Ŭ) =0, E(Ŭ?) = 0°, and 1? = o207 = o4/(1 — B?) 
[see Anderson (1959), Theorem 4.1]. 

(e) If O = 1, then E(¥nX)? = n-E(T?) > o2(1 + 28/0 — B)) = 
o7(1 — B)~? (see, e.g., Theorem 6.3.3 of Fuller (1976)]. 

( If 0 = I[U < t], then E(U?) = E(Ŭ) = F(t), and 


n-'ES*(t, B) > o%8(1 — B) F(t)(1 — B + 2BF(t)). 


Note /F(t)(1 — B + 2BF(t)) dt = œ for any —1< 8 <1, since F(t) 1 as 
t— œ. Hence we confine ourselves to finite measure H. See the proof of 
Lemma 3.1. 


The following Theorem 2 will be used to establish the asymptotic normality of 

B as well as that of some other commonly used estimators for 8. Note also 

that the conclusion of Theorem 2 holds for all stationary ARMA ( p, q) processes 

and for any X,_, where X,_, is a measurable function of X,_, such that 

E( X? k-1) < œ. For the sake of simplicity, we present the result and proof for the 
AR (1) model only. 


THEOREM 2. Let T, = <X,_,U,, as defined in (2.10), and let 
(2.12) =n ET?, 77 = lim 72. 
n> æ 
Then under model (1.1), T,/ Ynt, converges in distribution to the standard 
normal law. 


REMARKS. (a) If Ŭ, = SIU, < t)f(t)dH(t), then Theorem 2 yields the 
asymptotic normality of B and A, see (2.8) and (2.9). 

(b) For any £, let U, = WU, < t). Then T, = S(t, B) in (1.4). 

(c) If U0 = U, then Theorem 2 yields the asymptotic normality of the least- 
squares estimator [see Theorem 4.1, Anderson (1959)]. Note also in the least- 
squares case, {T,} is a martingale, while in our case it is unfortunately not. 

(d) By a discussion analogous to that of (e), (b), and (c) in the applications of 
Lemma 1, we conclude the asymptotic normality of sample average [see, e.g., 
Theorem 6.3.3 of Fuller (1976)] and of Koul’s estimators for symmetric and 
nonsymmetric F [see (3.5) and (3.1) of Koul (1985)]. 


MINIMUM DISTANCE ESTIMATOR 1185 


To prove Theorem 2, we utilize the following convenient central limit theorem 
for stationary processes. Note that, in the following Lemma 2, T is an ergodic 
one-to-one measure-preserving transformation on the o-field, F, generated by 
Y,, -œ <k< oo. 


LEMMA 2 [Heyde (1974)]. Let {Y,} denote a stationary and ergodic sequence 
with EY, = 0, EY? < œ, and put S, = L2_,Y,. Suppose that M, is a sub-o-field 
of F and MCT (M,), and put #,= TM) and y,= E(Y¥|My) — 
E(Y,|4_,),-0 <j < œ. If £2 y, = Z E L’, EZ? = of > 0, andn™'ES? > of 
as n > œ, then S„/oz/n converges in distribution to the standard normal law. 


PRooF OF THEOREM 2. Denote 
(2.13) Y, = X,_U,. 
Then, by (1.5), Y, is a measurable function of {U,_,: J = 0,1,2,...}. Hence by 
the same arguments employed in Proposition 6.6, 6.31, and Corollary 6.33 of 


Breiman (1968), both {X,} and {Y,} are stationary and ergodic. The fact that 
{X,} is ergodic is also covered by Theorem 3 of Hannan (1970), page 204. Now let 


(2.14) M , = 0{ U,,U,_1,U_2,---} 
and 

(2.15) Ye = E(Y,|M 0) — EY IM): 
Note that 

(2.16) Y= Y,- Y,=0 forallk < —1. 
If k = 0, then 

(2.17) % = Y- X_,E(U). 


For the case k > 1, we have 
E(Y,|4>) = E(U)(B*'U, + B*U_, + B**1U_, + +--+), 
E(Y,|#_,) = E(U)(B*U_, + Btt Ua + ---), 


and 

(2.18) y, = E(Ŭ)B* U, forallk>1. 

Hence 

(2.19) Z= Vy» = X_\(Uy) - EU) + UE(U)/1 - B). 


Note that the random variables X_,(U, — EU) and U, are uncorrelated. There- 
fore 


EZ? = o2Var(U) + (EŬ Y 0?/(1 — BY 
= 0f{E(0*) + (EÜ) 8/1 - B)} = 7?. 


Theorem 2 now follows by Lemmas 1 and 2. 0 


1186 C. W. H. WANG 


Note that Theorem 1 follows by Proposition 2, Lemma 1, and Theorem 2. 
Note that the asymptotic variance of Å depends on H. It is natural to pursue the 
optimality of H. For this, we first define the following functionals: 


(2.20) J(h) = {(1 — B?)A(A) + 2801 + B)B(h)}/D(A), 
where A is in #L?(R) and 


(2.21) A(h) = f JECA 8)f(t)f(s)h(t)A(s) dids, 
(2.22) B(h) = (eae 
(2.23) D(h) = (Ji n). 


THEOREM 3. Assume f is bounded. Then J(h) is continuous at h + 0 a.s. in 
L?-norm. Furthermore, there exists a nonzero h in L*(R) to minimize J(h). 


Proor. Since f is bounded, Hélder’s inequality implies that D! is a 
bounded functional on L?(R). Hence D(A) is continuous in A [Royden (1968)]. A 
similar argument applies for B(h). Next note that for any given ho € L?(R), we 
have 


|A(h) — A(Ao)| < f JERONA) - holy) llh) | dedy 


+ f [F(x) F(x) [hg ») || A(z) — holz) | dx dy 
< b,(IJAll + ||Aoll) 2 — Roll > 0 


as h —> ho in L*-norm. Hence J(h) is continuous at A where h +0 as. in 
L?-norm. Next note that J(ah) = J(h), for all a + 0 and for all ||A|| + 0. But 
{i|All = 1} is a compact set in L?(R). Hence the continuity of J implies the 
existence of h which minimizes J. This completes the proof of Theorem 3. O 


Note that L*(R) is a separable Hilbert space. By Theorem 3, we can now 
apply Theorem 40.1 of Gelfand and Fomin (1968) to minimize J(h). Note that, in 
view of the proof of the quoted theorem, it is not necessary for J(h) to be 
continuous everywhere. The details of this approach are rather involved and will 
be reported elsewhere. 

Finally, we would like to point out that another method to attain the lower 
bound in (1.3) for the large-sample estimation can possibly be done by using a 
general adaptive procedure [see, e.g., Fabian and Hannan (1982)]. The disad- 
vantage of this method is that it may require an extremely large sample to 
construct a nonparametric estimator for 8. This is fine for the earth-sciences data 
[see, e.g., Tukey (1978)], but it causes big problems for business or social-sciences 
data. Because this kind of data is usually subject to heavy outside influences, it is 


MINIMUM DISTANCE ESTIMATOR 1187 


then difficult to find a good model for an extremely long series. This is one of the 
reasons that the more direct approaches like minimum distance method are 
highly desired. 


3. Proofs of Proposition 1, Proposition 2, and Lemma 1. We split the 
proof of Proposition 1 into the following lemmas. This approach:is close to the 
one used by Koul and DeWet (1983) for the usual regression model. 


LEMMA 3.1. Iff is bounded and H is finite, then 


(3.1) limsupE sup n~ f (S(t, B) + n(A — B)o2f(t)}” dH(t) < œ. 
n ya|A—B\sa 


Proor. By using the fact that H is a finite measure and the inequality 
(a + b)? < 2(a? + b?), the left-hand side of (3.1) is less than or equal to 
lim sup 2n71E fS?(t, 8) dH(t) + limsup sup 2n(A- BY’os { f? dH. 
n n yi|4-Bisa 


The proof of the lemma is then completed by the assumptions and application (f) 
of Lemma 1, Section 2, which also rules out the consideration of H to be the 
Lebesgue measure over the whole real line. 0 


LEMMA 3.2. Assume (A.1), (A.2), E(U*) < œ, and H is finite. Then 


(3.2) sup fn-{S(t, A) — n(A — B)oZf(t)}” dH(t) = o(1), 
ynja— Bisa 


where 


(3.3) S(t, A) = ES(t, A) — ES(t, 8) = n fxF(t + (A = B)x) dG(z). 


Proor. By the Schwarz inequality, we have, for all ¢, 
n~{8(t, A) ~ n(A - B)ozF(t)}° 


(3.4) <n(A- BYS noget 
= O(1), 


since n(A — 8)? <a, F is Lipschitz, f is bounded, and E(X*) < œ. (8.4) also 
implies that lhs (3.4) = o(1). In detail, note that for any e > 0, there exists b > 0 
such that 


= ro) x*dG(x) 





ths (3.4) < e + a? f° (f(t + &,(x)) = F(t) }’x* dG (E), 


where |£ „(x)| < |Ax — 8x) < n7 ab > 0 as n > œ. Note e does not depend on 


1188 C. W. H. WANG 


n. Therefore, 
lhs (3.4) < e+ a?E(X*) sup {f(t+ &,(x)) -IY 
xeE[-6,6] 


=e+o(1) 
by (2.7), which is true under (A.1) and (A.2). We further have 
supn'/?(A — B)o2 f(t) = O(1) = supn™ S(t, A). 
t t 


Hence the dominated convergence theorem implies 


(3.5) fa HSU, A) - n(A - B)o3f(2)} dH(#) = 0(1). 
This in turn implies (3.2) by using the monotonicity of S(t, A) in A [see, e.g., the 
proof of Theorem 5.1 of Koul and DeWet (1983)]. 0 

LEMMA 3.3. Assume (A.1), (A.2), E|JU|® < œ, and H is finite. Then 


(3.6) sup froe{s(s, A) — S(t, B) — S(t, A)}? dH(t) = o(1). 
yn|4—-Blsa 
Proor. In view of the previous proof, it suffices to verify 
(3.7) nE{S(t, A) — S(t, B) — S(t, A)}* = o(1). 
Note that, by Lemma 3.2, 
(3.8) Ihs (3.7) = n™E{S(t, A) — S(t, 8)}? — n(A — B)*of f ?(t) + o(1). 
Denote 
(3.9) Dy = Xp—\(I[U, — (A - B)Xn-1 < t} — I[U, < t). 


By (1.5) and |8| < 1, D, is a measurable function of {U;; — o0 <j < k}. Hence 
{D,} is a stationary process and the first term of rhs (3.8) reduces to 


(3.10) ED? + 2n~\{(n— 1)ED,D, + (n — 2)ED,D, + --- + ED,D,}. 
Since F is Lipschitz, the first term of (3.10) reduces to 


(3.11) ED? < O(1)jA — B\|E|X3| = o(1). 
For the second term of (3.10), note first 
ED,D, 


= EX X,_,{I[U, — (A — 8)Xo < t] — I[U, < t] - (A ~ B)Xof(t)} 
x {7[U, — (A — B)X,_, < t] — I[U, < t]} 
(3.12) + (A — B)f(t)EXgX,-, 
x {I[U, vs (A F B)Xk-i <S t] ~ I[U, <s t] zi (A z B)X,_1f(t)} 
+ (A ~ BY f*(t)E(X¢X21) 
= B, + B, + B,, 


MINIMUM DISTANCE ESTIMATOR 1189 


say. We claim 


(3.18) nB, + nB, = o(1). 
To prove the above claim, in view of the proof of Lemma 3.2, it suffices to verify 
(3.14) nB, + nB, = O(1). 


To this end, note F is Lipschitz and hence 

(Xp Tt < Up < t+ (A~ B)X, a] TECA = B)X_ > OF 1) 
< Xf_,)4 — B)O(1), 

where #,_, is the o-algebra generated by {Xp,U,,...,U,_,}, and 

E(U,I[t < U, < t+ (A—B)Xo]I[(A — B)X = 0] |%) 


E 
(3.15) 


(3.16) 
< O(1)(A — B)XoI[(A — B)X, = 0]. 


By (3.15), EX‘ < oo, and the boundedness of f, we then have nB, = O(1). Using 
this, we further have 


f = 2 
(3.17) ni|By| < n: O(1)(A — B)EX) Xp, 


x (I[U, — (A — B)X, < t] — 1[U, < t]} + 002). 


Let Ip, = I[U, — (A — 8)Xo < t] — I[U, < t], then the rhs (3.17) is bounded 
by 


n° O(1)(A nae B)E( X11) 
+n - O(1)(A — B)E{XoIp,(B* X, + B*-*U,)"} + O(1) 


< O(1), 
where the inequality holds by (3.15), (3.16), and E|X°| < œ. This completes the 
proof of (3.14) and therefore (3.13) holds. The proof of the lemma is then 
completed by using (3.10)-(3.13) and by the facts that E(XjX}_,) = 
B?*-2E( X*) + of(1 — B?*) and that 


n-1 
n! Ð (n—k)B* = o(1) + B/(1— B) forall |B| < 1. o 
k=1 


PRooF OF PROPOSITION 1. To simplify the notation, we denote 
fg?(t, A) dH(t) = |igq||?,, for any measurable function g. Then, by Minkowski’s 
inequality, 

2 
QCA) — QCA) = Ihi — Sp + 2(4 - B) IL, 
< {IS — Sp — Silla + |S, — 2(4 - B) oz lla) 
x {IS — Sp — Silla + |S - (A — BY ORF |y 


+2||S, + n(A — B)ozf l 


1190 C. W. H. WANG 


where the conditions for Minkowski’s inequality are assured by the above lemmas 
which further imply (2.1). This completes the proof of Proposition 1. O 


PROOF OF PROPOSITION 2. In view of (2.8), it suffices to prove 


(3.18) vn(B — A) +0. in probability. 
Note the measurability of Ê can be achieved by giving a definite rule of selection 
for 8. But the procedure is long and uninteresting and will be omitted without 
further discussion. The following proof of (3.18) is close to the one in Williamson 
(1982). Note first by (2.2) and (2.8) we have 
@,(A) = INSpllie + notl Fiz {(A — B)” - 2(A — B)(A - B)}, 

Q (Ê) T Q (Â) z n*o$il f (Ê = Ay’, 
and 
(3.19) noxll fix(B ~ A)’ = n-(@,(B) - @,(A)). 


By Proposition 1 and Chebyshev’s inequality, for any fixed a there exists e > 0 
such that 


(3.20) P| smp nola) ~ @(4)| se} > 1. 
yn \A—Bisa 

This motivates us to consider the set 

(321) Aa) = {vÂ -psa js Q(A)< | inf @(A)}. 

It is reasonable to guess, for any small e > 0, 

(3.22)  P{A,(a)}>1-—e forsome a and for sufficiently large n. 


This guess will be proved formally in Lemma 3.4 below. In view of (8.22), (8.20), 
and (3.19), to prove (3.18) it suffices to prove 


(3.23) sup n|Q(A) - Q,(A)| <e 
yA Bisa 


implies n~ Q(B) - Q,(A)} < 2e on A,(a). Indeed, by the assumption of (3.23), 
we have, on A,(a), 


(3.24) — n-*|Q(B) - Q,(B)|<e and n-|Q(A) - Q,(A)| < e; 
hence i 
n™°’Q Ê) - nR Â) < {n71Q(B) +e} -nR CÂ) 
s {(n'Q(Â) + e} — nQ (Â) <e + e= 2e 
on A,(a), where the first and last inequalities hold by (3.24) and the second 
inequality holds by (1.10). This completes the proof of (3.23) and Proposition 2. O 


The next lemma will make up the gap in the above proof. See (3.22). 


MINIMUM DISTANCE ESTIMATOR 1191 


LEMMA 3.4. Under the assumptions of Proposition 1, we have 
(3.25) P{A,(a)} >1—e for some a and for sufficiently large n, 
where A,(a) ts defined in (3.21). 


Proor. Define 
(3.26) L(A) = fsa, FG) dH(t). 


We will use the asymptotic linearity of L(A) as leverage to prove the lemma. By 
the asymptotic linearity of L(A) we mean 


(3.27) sup nV? 
yajA— Bisa 


in probability. Indeed, 





L(A) ~ L(B) ~ (A ~ B)no f°? azz| > 0 


lbs (3.27) < sup n~"? fist A) — S(t, B) — S(t, A) FŒ) dH(t) 


+ supn~¥? [|S(t, A) — (A — B)nozf(t)|VF(t) dH(t) 
~0 
in probability by (3.6), (3.2), and the Schwarz inequality. Next, by the fact that 
n1L*(B) < n-1Q(B) f fdH = 0,(1), 


we extend (3.27) to 


(3.28) sup 


n-'Z%(A)—{n7L(B) + (A - B)n?o2 ff saan) | 550. 
yn|A- Bisa 





Note also, by the Remark (a) of Theorem 2 and Lemma 3.1, we have 
For all e > 0 there exist d > 0 and ny > 0, such that for all n = no, 
e e 
(829) P{YnlA- A] < d} > 1- Ž and P(n-9(B) ffan < a) 21-5. 
Obviously, 
(3.30) _ n-°|L(B)| < Vd on the set (nole) fian < a). 


Hence, on the set {n~1Q(8){fdH < d}, by choosing a > 2Vd {ff 32 dH}! we 


1192 C. W. H. WANG 


have 
amn (PEC) + (A - 8)r? fi man) 
(8.31) > {—n-12(8)| + aff aan) 


> (-Vd + /d}* = d> nQ(B) ffdH. 
Consequently, by (3.30), (3.29), and (3.28) we conclude 


(3.32) P{n"9(8) ffan < d and Q(B) ffdH < mO) 21-5 


for large n. Finally, note 
inf Q(A)/fdH=> inf L(A 
(3.33) jà- gj>a ( Ji Aja -p> a (4) 
= _inf LĽ(A)= min L?(A), 
yn|A—Biza yn|A— Blea 


where the last equality holds because L(A) is nondecreasing in A. (3.33), (3.32), 
and (3.29) together yield (3.25). 0 


PrRooF oF LEMMA 1. Note that 
(3.34) n E(T?) = of E(U?) + 2n! X E(X, X, Ŭ, )E(Ŭ). 


J<k 

For all j > 0 let 

(3.35) h(j) = E(X,X,0,). 
Then, by induction 

(3.36) h(j) = of E(0)p?. 


Since {X,_,X,_U,: 1 <j < k < n} are identically distributed, the second term 
of the rhs (3.34) equals 


2E(O)n-4{(n — 1)A(1) + (n — 2)A(2) + «+ +A(n - 1)} 


= 2E(O)o3n F. (n—f)B! = 2E*(C1)028/(1 — B) + o(1). 


j=l 
This completes the proof of the lemma. O 


Acknowledgments. The author would like to thank the Associate Editor 
and the referees for their careful reading of the original manuscript and for their 
suggestions that led to substantial improvements in the presentation and the 
final result of Theorem 3. 


MINIMUM DISTANCE ESTIMATOR 1193 


REFERENCES 


AKRITAS, M. G. and JoHNSON, R. A. (1980). Efficiencies of tests and estimators for pth-order 
autoregressive processes when the error distribution is nonnormal. Technical Report, 
Dept. of Statist., Univ. of Wisconsin, Madison. ‘ 

ANDERSON, T. W. (1959). On asymptotic distributions of estimates of parameters of stochastic 
difference equations, Ann. Math. Statist. 30 676-687. 

Box, G. E. P. and JENKINS, G. M. (1976). Tune Series Analysis: forecasting and control. Holden-Day, 
San Francisco. 

BREIMAN, L. (1968). Probability. Addison-Wesley, Reading, Mass. 

FABIAN, V. and HANNAN, J. (1982). On estimation and adaptive estimation for locally asymptoti- 
cally normal families. Z. Wahrsch. verw. Gebiete 59 459-478. 

FULLER, W. A. (1976). Introduction to Statistical Tune Series. Wiley, New York. 

GELFAND, I. M. and Fomin, 8. V. (1968). Calculus of Variations. Prentice Hall, New Jersey. 

HÁJEK, J. and SmAx, Z. (1967). Theory of Rank Tests. Academic, New York. 

Hannay, E. J. (1970). Multiple Tune Series. Wiley, New York. 

HEYDE, C. C. (1974). On the central limit theorem for stationary processes. Z. Wahrsch. verw. 
Gebiete 30 315-320. 

Kout, H. L. (1985). Minimum distance estimation and goodness-of-fit tests in first-order autoregres- 
sion. Ann. Statist. 14 1194-1213. 

KouL, H. L. and DeEWer, T. (1983). Minimum distance estimation in a linear regression model. 
Ann. Statist. 11 921-932. 

Lar, T. L. and SIEGMUND, D. (1983). Fixed accuracy estimation of an autoregressive parameter. 
Ann. Statist. 11 478-485. 

MILLaR, P. W. (1982). Optimal estimation of a general regression function. Ann. Statist. 10 717-740. 

Roypen, H. L. (1968). Real Analysis. Macmillan, New York. 

TUKEY, J. W. (1978). Can we predict where “time series” should go next? In Directions n Tune 
Serves (D. R. Brillinger and G. C. Tiao, eds.) 1-31. IMS, Hayward, Calif. 

Verrcu, J. G. (1983). Minimum distance procedures in stationary time series. Ph.D. thesis, Univ. of 
California, Berkeley. 

WILLIAMSON, M. A. (1982). Cramér—von Mises-type estimation of the regression parameter: The 
rank analogue. J. Multivariate Anal. 12 248-255, 


DEPARTMENT OF MATHEMATICS AND STATISTICS 
TRENTON STATE COLLEGE 
EWING TOWNSHIP, NEW JERSEY 08625 


The Annals of Statshcs 
1986, Vol. 14, No. 3, 1194-1213 


MINIMUM DISTANCE ESTIMATION AND GOODNESS-OF-FIT 
TESTS IN FIRST-ORDER AUTOREGRESSION’' 


By Hira L. KOUL 


Michigan State University 


This paper gives a class of minimum L,-distance estimators of the 
autoregression parameter in the first-order autoregression model when the 
errors have an unknown symmetric distribution. Within the class an asymp- 
totically efficent estimator is exhibited. The asymptotic efficiency of this 
estimator relative to the least-squares estimator is the same as that of a 
certain signed rank estimator relative to the sample mean in the one sample 
location model. The paper also discusses goodness-of-fit tests for testing for 
symmetry and for a specified error distribution. 


1. Introduction. Let {e,, i= 0, +1, +2,...} be independent random vari- 

ables (r.v.’s) that are identically distributed according to a distribution function 
(d.f.) F. Let {X,} be an observable process such that, for jej <1, X,_, is 
independent of e, and 
(1) X,=pX,,+8, %t=0,4+1,+2,.... 
The above process {X,} is called the first-order autoregressive (AR(1)) process. 
This paper considers the minimum distance estimation of p based on the 
observations { Xo, X,,...X,} when F is not necessarily known. Also considered 
are tests of symmetry of F and tests of the goodness-of-fit for F. 

Of course the classical estimator * = £7_,X,_,X,/02.,X2, is a minimum 
distance estimator. But this estimator is highly inefficient for non-Gaussian 
errors, including contaminated Gaussian errors [Fox (1972), Denby and Martin 
(1979), and Martin (1981)]. The minimum distance (m.d.) estimation methods 
that lead to efficient and robust estimators in models involving independent 
observations are those promoted by Wolfowitz (1957). These methods are further 
studied by Beran (1977, 1978), Williamson (1979), Boos (1981, 1982), Parr and 
Schucany (1980), Parr and DeWet (1981), Millar (1981, 1982), Koul (1980, 1985), 
and Koul and DeWet (1983), among others. See also Parr (1981) for a detailed 
bibliography prior to 1981. 

The most common distance statistics used in the literature are the Cramér— 
von Mises type statistics. Some of the reasons for this are that the corresponding 
m.d. estimators are consistent, asymptotically normal, qualitatively robust 
against certain contaminated errors [Millar (1981, 1982) and Koul (1985)] and 
locally asymptotically minimax (Millar, op. cit.). In view of these properties of 
practical import it is highly desirable to seek m.d. estimators of p in (1), using a 
suitable Cramér—von Mises type statistic. 


Received March 1984; revised October 1985. 

|Research supported by NSF Grant 82-01291. 

AMS 1980 subject classyications, Primary 62G05; secondary 62G20, 62G10. 

Key words and phrases. Weighted empirical residual process, stationary, ergodic, influence curve. 


1194 


MINIMUM DISTANCE AUTOREGRESSION 1195 


To motivate the definition of m.d. estimators of p, let us recall a result from 
Koul and DeWet (K-D) (op. cit.). Consider a simple linear regression model 
through the origin: Y, = c,8 + e,,1 <i < n, where {e,} are iid. F, F a known 
d.f., {c,} known constants. It was shown in K-D that a ¢ minimizing 


n 2 
(2) J| È AI, <y te) = F0) | ats), 
t=] 

with H as in (6) below, is asymptotically an optimal estimator of 8 among a class 
of estimators including the one based on an L,-distance between the ordinary 
residual empirical process and F. 

It is natural to consider estimation of p in the autoregression model (1) by 
formally replacing c, in (2) by X,_,. Thus, if we know the error d.f. F in (1), we 
may define an estimator of p as a t that minimizes 


n 2 

(3) J| È XIA, y+ eK. - r0) dH(y). 
i=] 

But in practice F is rarely known and we must now eliminate the centering F' in 

(3). For that purpose we shall assume that 


(4) F is symmetric around 0. 

Then a way to eliminate F in (8) is to replace it by the indicator [(— X, < y — 
tX,_,) because at ¢ = p, this indicator also estimates F(y). We are thus moti- 
vated to introduce 


Sly, t) — ne 5, X,_ {I(X, syt tX,_;) a I(-X, <y- éX,_,)}, 


tI 


(5) y, tin &, 
M(t) = [S*(y,¢)dH(y), ting, 


where 


(6) H is a known nondecreasing right continuous function from @ to 
&, inducing a o-finite measure on (#, #)—the Borel line. 


Now define f by the relation 
(7) inf M(t) = MÇP). 


The dependence of 6 on H will be exhibited only occasionally. The proposed class 
of m.d. estimators is {6(H), H varies}. The optimality with respect to H is 
discussed in Remark 3.2. 

This paper studies some finite and large sample properties of the class of 
estimators {6(H)}. Section 2 discusses some finite sample properties including 
the computational and scale invariance aspects. In general the class of estimators 
{p(1)} is not scale invariant. These estimators can be made scale invariant by 
using {8~'X,, 0 < i < n} in place of {X,, 0 < i < n} in (7), where s is a suitable 
scale invariant estimator of a scale parameter of F. See (2.7) below for an example 


1196 H. L. KOUL 


of s. Section 2 also contains extensions of 6 to the AR(1) model with location 
parameter as well, and to the AR(2) model. 

In Section 3, a class of estimators {ĝ,} is introduced and its asymptotic 
normality is proved. The estimator ô is a member of this class. Theorem 3.3 
asserts an asymptotic optimality of 6 among the class of estimators {p,}. This 
result is similar to Theorem 3.2 of K-D. In the same section, p is compared to 
other estimators. The asymptotic efficiency of {6(H)} relative to ? is the same as 
that of certain signed rank estimators relative to the sample mean in the one 
sample location model; see Remark 3.3. Asymptotically, {6,()} are like GM- 
estimators of Denby and Martin (1979) corresponding to a 4 that depends on F, 
see Remark 3.7. 

In Remark 3.4 an estimator of the asymptotic variance of n!/?(6(H) — p) is 
provided for H(y) = y. The effect of asymmetry of F on # is evaluated in 
Remark 3.5. It is noted that if F is asymmetric but Ee, = 0 then 6 has no 
asymptotic bias but its asymptotic variance is larger than that under symmetry. 
In Remark 3.8 it is noted that the influence of contaminating {e,} on ĝ is zero at 
a symmetric F. Remark 3.9 discusses the asymptotic distribution of the scale 
invariant version of {6(H)}. Finally, Section 3 briefly gives the asymptotic 
distributions of the extensions of 6 introduced in Section 2. 

Section 4 discusses tests of the hypothesis of symmetry H, and tests of 
goodness of fit for a specified error distribution. In both situations the asymptotic 
null distributions of the proposed tests are shown to be the same as those of their 
counterparts in the one sample location model. See also Pierce (1985) for a similar 
observation with regards to tests of the normality of errors. The asymptotic null 
distribution of the Cramér—von Mises type tests based on the ordinary empirical 
for testing for a specified error distribution does not depend on the specified error 
d.f., as long as its mean is 0. Section 5 contains all the proofs. 


Notation. All limits are taken as n — oo, unless specified otherwise. By 
0,(1) (O,(1)) is meant a sequence of r.v.’s that tends to zero (stays bounded) in 
probability. The dependence of various entities on n is not exhibited, for the 
sake of convenience. The real line is denoted by £. i 


2. Some finite sample properties of ĵ and extensions. For the purpose of 
computation of ĵ the following representation of M is useful. Fix a ¢ in @ and let 
Z, = X,- tX,_,, ¿= 1,..., n. Let c = max{Z,, —Z; 1 <i < n}. Observe that in 
(1.5), S(y,t)=0 for all y> c. Now use the fact that for any reals a, b, 
2max(a, b) =a + b + ja -— b|, and the nondecreasing nature of H to conclude 
that for any real t 


M(t) = nE DAA, ||H(Z,) - H(-Z,)| 
(1) Ei 
-3{|H(Z,) - H(Z,)|+|H(-2,) - H(-2,)|}], 
where A, = X,_,. If 
(2) |H(a) - H(b)|=|H(-a)- H(-b)|, a, bin &, 


MINIMUM DISTANCE AUTOREGRESSION 1197 


then, for t€ @, 
M(t) as ny EXX, [A(X os tX,_,) = H(-X, + tX,_,)| 
(3) E 
-|H(X,- tX,-1) — H(X, - ¢X,_,)|]. 


In the derivation of (1), H is assumed to be continuous. If H is not continuous (1) 
continues to hold with probability 1 as long as F is continuous. In any case the 
representation (1) or (3) make it clear that the computation of A is similar to that 
of maximum likelihood estimators. 

The assumption (2) is that H is symmetric around 0, which is natural when 
F is symmetric. Useful examples are H(y)=y and H given by dH = 
{RL — F))}~* dF, Fy a known symmetric d.f. 

To overcome the difficulty due to the possible nonuniqueness, modify f as 
follows. First observe that by the Cauchy—Schwarz inequality, for all ¢ in @, 


(4) M(t) > { fst, t)g'*(y) ant »)} | faan, 
where g is a nonnegative function on & such that 

(5) 0< fean <0. 

Let 


L(t) = [S(y, t)g’*(y) dH(y) 
(6) = fn X, (HX, < y + tX) - (=X, < y - tX,1)) 


xg'"(y) dH{( y). 
Clearly L(t) is a nondecreasing function of t. Therefore, by (4), M(t) is bounded 
below by a nonnegative function which is nonincreasing on (— 00, bọ) and 
nondecreasing on [bp, 00) for some finite bọ. Consequently, 6 may be uniquely 
defined as an average of the two quantities at which M(t) is minimized for the 
first time and for the last time as t moves from the left to the right. 

Now consider the question of scale invariance of 6. Write 6(A) for 6(H) when 
H(y) = y, (A for the Lebesgue measure). Note that A(A) is scale invariant in 
the sense that f(A) based on {bX,, 0 < i < n} is the same as the f(A) based on 
{X,, 0 <i < n} for all bin &. In general {6(H)} are not scale invariant in this 
sense. One way to make these estimators scale invariant is to base them on 
{s~1X,, 0 < i < n}, where s = s(X) = s( Xo, X,,... X„) is a scale estimator such 
that s(bX) = |b|s(X) for all real b. We mention one such estimator: 


(7) s = med{|X, — ÔX,-1b1 < i < n}, 


where ô, is a scale invariant estimator of p, e.g. À(À) or ?. The effect of making 
ĝô(H) scale invariant on its asymptotic distribution is discussed in Remark 3.9. 


1198 H. L. KOUL 


Next, consider extensions of ĵĝ. First, consider the model where, for @ in £, 
jel <1, 


(8) X,=0+pX,,+6, %t=0,+1,42..., 
with e,, X,_, as in (1.1). To define m.d. estimators of (8, p), let 
S(yja,t) =n} {1 X,< y+ at tX,,) 


w=j 
-I(-X, <y 7a- tX,_1)}, 
(9) SiC 9; a, t) = n 2 X,_,{1(X, s y tat iX,..;) 
rm] 
-I(-X, <y 7-a- tX,-1)} 


M(a, t) = f {S8C a, t) + S?(y;a,2)}dH(y), a, t, yin 2. 


Now define (6, ô) by the relation 
(10) inf M(a, t) = M(9, A). 


Next, consider the stationary AR(2) process, where for p,, p, real, 
(11) X, = P X,- + poXy-2te,,  i=0, +1, +2,..., 


t 


with e,, X,_, as in (1.1). To define m.d. estimators of (p,, p2), define for t, t, 
in &, 


n 
Sy; b, tz) = ny X, {I(X, s y + tX,- T t,X,.2) 
(12) ie 
-—I(-X, <y i. tX, = t,X,_o)}, J = 1,2. 
Let 


(18) M(t, to) = f (SRO; ts ta) + SCI; tus t2)} dH(y), ty, te in @. 
Define (6,, 6.) by the relation 
(14) int CE, ta) = M(6,, a). 


The asymptotic distributions of the estimators defined in (10) and (14) are 
summarized at the end of Section 3. Extension of ĵ to the pth order stationary 
autoregressive series with location 6 is now apparent from (8)-—(14). 


3. Asymptotic behavior of 6 and 6,. This section studies the asymptotic 
behavior of 6 under fairly general assumptions on the underlying quantities. 
Actually, we first study the asymptotic distribution of a class of estimators {),}, 
to be defined shortly, of which ĝ is a member. Then we deduce various asymp- 
totic results about f including its optimality within the class {f,}. 


MINIMUM DISTANCE AUTOREGRESSION 1199 


P define f,, let k be a measurable function from Æ to 2 and define, for y, t 
In A, 

SKO #) = mY AXi ICX, < y + 4) — I-X; < y- tX, 1), 
(1) = 

M,(t) = SRC, t) dH(,). 

Now define p, by the relation 
(2) inf M(t) = M,(8,). 
Again, the dependence of 6, on H is suppressed. For each H we have a class of 


estimators {6,; h varies} which reduces to ĵ of (1.7) upon taking A(x) & x. 
In order to obtain the results, we shall need the following assumptions: 


(A.1) F has a continuous density f with respect to A—the Lebesgue measure on 
(2, B). 

(A.2) 0 < ff’ dH < œ, r= 1,2. 

(A.3) 0 < Ee? < œ; 0 < E|X,|"|A(X_)|? < œ, for r = 0,1,2. 

(A.4) lim, „o fE{(XoM( Xo) f(y + 8X0) — FOP dH(y) = 0, 
lim, -, of Eh%(Xq)|Xol f(y + 8X) dH( y) = Eh?(Xp)|Xol [f dH. 

(A.5) JQ — F(y)) dH(y) < œ and H satisfies (2.2). 


Next, define, for ¢ in £, 


(3) Q(t) = f[S,(y, p) + 2E(Xoh( Xo) f(y)n'*(t — p)]? dH(y). 
We are now ready to state some results. 


THEOREM 1. Let {X,, i= 0, +1,...} be as in (1.1). Assume that F satisfies 
(1.4) and that (A.1)-(A.5) hold. Then, for any 0 < b < œ, 


(4) E sup |M,(t) — Q,(t)|= o(1). 


jn'A(t— pjs b 
PrRoor. See Section 5. 


REMARK 1. If E(X ,A(X,)) = 0, from (8) and (4) it follows that M, cannot 
be used to recover p asymptotically. In particular, if A(x) = 1, then because of 
(1.4) and (A.3), M,—the ordinary Cramér-von Mises statistic based on the 
residuals { X, — tX,-ı} and {—X, + tX,_,}—cannot be used to estimate p in the 
above fashion. 


THEOREM 2. In addition to the assumptions of Theorem 1, assume that 
(A.6) either xh(x) = 0 for all x or xk({x) < 0 for all x, 
(A.7) E{X)h(X)} #0. 


1200 H. L. KOUL 


ie PET 


x ['8,(¥, 0) f(y) dH( y) + 0,(1). 


Proor. See Section 5. 


COROLLARY 1. Under the assumptions of Theorem 2 
(6) n/2(p, — p) = N(0, 72) r.v., 
where 


r$ = (2E(Xoh(X-)) fF?AH) BA X)EVe), 
v(x) = vol) = yola), vola) =f? fal, xing. 


Proor. Set Y,, =n7'7A(X,_,)¥(e,), 1 < i < n. Then 


(7) — [Sy 0) f(y) dH(y) = E Yn = Tan 
t=] 
where 
J 
T, = È Yn 1sjsn. 


For |p| < 1, {X,} is strictly stationary ergodic and T,,; is #, measurable, where 
F, = o-field {e,, i < 7}. Moreover, the symmetry of F around 0 ensures Ey(e,) = 0, 
so that the independence of £, and X,_, yields 


EY,,=0, foralli; EY,,Y,,=0, forall >1, 
EY, =n Eh*(X))Ey*(e,), for all i. 


Thus {Tp F; 1<j <n} is a mean zero square integrable martingale array. 
From (8), ET, = 0, 


(9) 52 = Var(T,,) = Eh*( Xp) EY"(e,) = ER?(Xo)Var(4(21)) = 72, say. 
The ergodic theorem yields that 


(8) 


(10) nÈ h?(X,_,) > EA?(X,) as. 

i=] 
These observations may be used to verify that sufficient conditions of Corollary 
3.1 of Hall and Heyde (1980) are satisfied so that T,, = N(0,y”). The claim 
about p, now follows from this and (5). 0 


MINIMUM DISTANCE AUTOREGRESSION 1201 


Now, the Cauchy—Schwarz inequality implies that 
(11) {EX,h(Xq)} PER? (X) > {EX2}, 


with equality if, and only if h(x) « x. Let 7? denote rê when A(x) « x. Note 
that +? denotes the asymptotic variance of 6. Thus (11) says that 


(12) T$ = 7? with equality if, and only if h(x) « x. 
We also have 

(13) EX} = 0o7{1- ep}, o? = Var(e,). 
Moreover, the symmetry of F and H yields 

(14) Ey(e,) = Var y(e,) = 4K(F, H), 
where 


K(F, H) = f [F(x A y) — F(x)F(y)] f(x) f(y) (x) dH(y). 
Consequently, 


22 (4x2) Wvar(v(e)){ frtan) 
(15) 


-2 
= (1 — p*)o-°K(F, H){ ffan) l 
Summarizing the above results we have proved the following 


THEOREM 3. Among all estimators {p,,} of p in (1.1), where h satisfies (A.3), 
(A.4), (A.6), and (A.T) for every F and H satisfying (A.1), (A.2), and (A.5), the 
estimator that minimizes the asymptotic variance 1} is p—the p,, when h(x) & x. 
Moreover, n? (ô — p) = N(O, t?) r.v., +? as in (15). 


REMARK 2. Theorem 3 above proves the asymptotic optimality of (H) 
among a class of estimators {p,(H), h as in Theorem 3} for every fixed H. As far 
as finding an optimal # in the class {p(H), H varies} is concerned, observe that 
H appears in the asymptotic variance of (H) only through V(F, H) := 
K(F, H){ {f?dH}~*. The term V(F, H) is precisely the asymptotic variance of 
an M-estimator of the location parameter corresponding to the yọ function. 
From the minimax theory of such estimators (Huber, 1981), one readily concludes 
that the optimal H is given by the equation 


faH = -I'd(f'/f),  0<I= [( f/f) dF < o, 
where now it is assumed that f’ is ae. derivative of f. A consequence of this 


observation is that if H(y) = y or if dH = {FQ — F)}~!dF then the corre- 
sponding ô are asymptotically efficient for logistic errors. 


1202 H. L. KOUL 


REMARK 3. Comparison with = E” X, ,X /£?_, X2.. Recall that under 
(A.3), n!” (P — p) = N(0,1 — p°) r.v. Thus the asymptotic efficiency of ĝ vs. ° is 


e = e(p, f) = {asymptotic variance m? (F — p)} 
(16) x {asymptotic variance (n!/?(4 — p))} Fi 
= 0°(V(F, H)} ` = 0% {f? aH} ({K(F, H)} >. 


Observe that the asymptotic efficiency e is similar to that of signed rank 
estimators corresponding to the score function y( F~ (u)) of the location param- 
eter vs. the sample mean in the one sample location model. Thus for example if 
H( y) = y, then e = 1207 ff ? dy}"—the well-celebrated expression in connection 
with the Wilcoxon rank estimator—see Lehmann (1975). In this sense (à) may 
be said to be an extension of the Wilcoxon type rank estimator to the autoregres- 
sion model (1.1). 


REMARK 4. An estimator of t? when H( y) = y. In the case H( y) = y, the 7? 
of (15) becomes 


(a7) em (1 - Poa faa). 


Thus to estimate r? one needs to estimate (1 — p°)o™? and ff?’(x)dx. An 
estimator of (1 — p*)o~? is obviously (n` E? X2;)`!. Thus it remains to 
construct an estimator of 


(18) b( f) = fE? (x) dx = f(x) dF (x). 
Define 
(19) (yy, F) = [[F(y + x) - F(~y + x)] dF(x), y>0. 


For any y > 0, (2y) n!b( yn", F) > b( f )as n > œ. This suggests that we 
first estimate b( y, F). An estimator of b( y, F) is 


(20) BC) = nf [V(y + x, Bo) - W-¥ + x, Bo)] Vd, ĉo), 
where 
(21) When AER ra ka 


and ô, is an estimator of p, not necessarily 6. Observe that & is the empirical d.f. 
of {|X,— X, — bo X,-1 ~ X,-), 1 <i, J < n}. Let 8, be an ath quantile of 
this empirical d.f. Define 


(22) pr = [a x x (12°28, n) Abn As, n)]) 


-2 


MINIMUM DISTANCE AUTOREGRESSION 1208 


It is believed that if f is uniformly continuous and bounded, 0 < Ee? < oo and 
n'/?(6, — p) = O,(1) then 7? is consistent for 7°. See Koul (1984) or Sievers 
(1984) for the proof of a similar result in the linear regression model. 


REMARK 5. Effects of asymmetry on p. Suppose that F is asymmetric but 
has mean zero so that EX, = 0. Consequently, ES(y, p) = 0 and one can still 
define p by (1.5) and (1.7). Such a 6 will not have any asymptotic bias. In fact, 
under suitably modified assumptions that compensate for the lack of symmetry 
one can prove that 


ni2(6 — p) = -{EX8 fe? dH} JSC, pes) dC) + opl), 


where 
a(y)=f(y) +f(-y), yina. 
Similarly to the arguments of Corollary 1, one can then deduce that 


n/?(6 — p) = N(0, x°) rv. 


where now 
-2 n 
- (2x? fa?an) lim Var 0-4 2 X,-:G(e,) ’ 
n>% 1 
and 
G(x) = G(x) - G)(-x), 
G(x) = f a(y)dH(y), xina. 
Direct calculations show that 


v? = (EX? fe? dH) EX}(VarG(e:) + (1+ p)( - 9) [EG(e,)}"} 
(23) 


= (Jeran) (Ex 'VarG(e,) + (1 + p)’o-*[ EG(e,)]*}. 


Now suppose that H is symmetric around 0. Then G(x) = yolx) — Yo(—x) + 
Yo(00), G(x) = y(x) — (~x) = 24(x), for all x; y as in (6). 
Consequently, from (13) and (15), 


-2 2 
= (Je am| (rea ffan +4(1 + oo *[B4(«)]"}. 
Therefore, from (15), and the inequality {g? dH < 4ff ? dH, one gets 


Fj ee py ge Leven 
T? l-e ” 4K(F, H)’ 
Thus, as long as Ey(e,) # 0, so that a > 0, one has 
2 
x 
(24) v? >rt?, forallp and sup — = œ! 


lel<1 T? 





1204 H. L. KOUL 


Asymmetry of F and absolute continuity of H is enough to ensure Ey(e,) # 0. 
Thus even though ĵĝ has no asymptotic bias its asymptotic variance can be 
heavily influenced by the asymmetry of F even when Ke, = 0 and H is symmet- 
ric around 0. 


REMARK 6. During the preparation of this paper the author received a 
preprint of the paper by Wang (1986) proposing a m.d. estimator §,, of p in (1.1) 
as a minimizer Ł of 


W(t) = ple E X, (Xs yt maol dH(y). 


t=} 


Wang does not require symmetry of F but assumes Es = 0 and that H is 
bounded. The asymptotic variance o2 of n'/*(6,, — p) turns out to be 


-2 
o2= 1+ la + py Eole, )zlo fiar) | 7? asin (15). 
Thus, for symmetric F and H, 


ro = 1+ [(1 + p)(1 = p) *{[B¥0(e,)]"/Var vole) }], 
which is arbitrarily large for p close to 1. 

REMARK 7. Connection with GM-estimators. If in (2.5) of Denby and Martin 
(op. cit.) one takes g = h, (x) = JË „f dH, then one has |n??(ĝ;, — dem)! = (1); 
where oom is the GM-estimator. Now it is known that if p(x) of oom is x then 
om = f. But choosing ¥(x) =x would violate our assumption (A.2). Thus 


{6()} has no connection with the least-squares estimator f under the condi- 
tions of this paper. 


REMARK 8. Influence curve of p,,. Let 
m(t, F, y) = E{S,(y, t)} 
= Eh(X,)|F(y + (t— p)X) + F(—y + (t — p)X,) - 1], 


w(t, F) = fm?(t, F, y) dH(y). 


Define T( F) by the relation 
infp(t, F) = p(T(F), F). 


If F is symmetric around 0 then T(F)=p. Let L be a d.f. and define 
F, =F+s8s(L—F) 0<s<1, T, as a minimizer of p(t, F,) wrt. t. If F is 
symmetric then T = T(F)=p. If L(y)=84y)=K{yzz) and if h= 
(0/48)T,|,.9 = 0 exists, then 7, is called the influence curve of T(F) at F. 
Proceeding as in Huber (1981), one can derive, under some regularity conditions, 


MINIMUM DISTANCE AUTOREGRESSION 1205 


that 


ERX) [volz) ~ vo(-2)] = 
DEX,MX) ap © VOSS Fae 


Under (A.2), Yo is bounded. Thus the influence of z is bounded on T(F) and 
hence on p,, at symmetric F. In particular if h(x) =x and Ee, = 0, so that 
EX, = 0, then ô is not influenced by the contamination of errors. 


IC(z, T, F) = 


REMARK 9. Asymptotics of scale invariant version of p(H). Let s = s(&) = 
8( Xo- -- Xn) be an estimator such that s is positive and 


(25) (bX) = |b\s(X), forall bin 2. 


In addition, assume that there is a positive constant y, possibly depending on F, 
such that 


(26) n'/*(¢— y) = 0,(1). 
Let p*(H) denote the scale invariant version of 6(H) proposed near (1.7). Then, 
under some additional conditions on F and H, one can prove that 

n'/?(p*(H) — p) = N(0, 73) r.v. 
where 72 is obtained from the +? of (15) after H(-) there is replaced by H(- /y). 
See Corollary 5.1 below for the proof of (26) for the estimator s of (2.7). 


REMARK 10. Asymptotic distributions of extensions of ĝ. Recall the defini- 
tion of (6, 6) from (2.8)-(2.10). Under suitable assumptions one can show that the 
asymptotic distribution of n/(@ — 6, — p) is bivariate normal with the 
asymptotic mean 0 and the asymptotic covariance matrix 


1 n Y EX, | 
(27) =| a 5 X Vise ¥o)- 
no SEX no >) Exe, 
re] i=1 


Next, recall the definition of (6,, 6.) from (2.12)—(2.14) in the stationary AR(2) 
model (2.11). Again, one can deduce under suitably modified assumptions that, 
for a symmetric F, n!/*(p, — Pis bp — P2) = N,(0, £3) r.v.’s, where 

EX? EX)X,|~ 
EX)X, EX? 
In the above, V,,.(9) = Var(to(e1))/( {vo dF)? = K(F, H)/( Jf ° dH)’. 

4. Tests of goodness of fit. Consider the model (1.1) with F as d.f. of {e,}. 

Consider the problem of testing 


Ho: F(y)=1-F(-y), forall y. 


(28) “Sys X Vigel Yo)” 








1206 H. L. KOUL 


A natural class of tests of H, is given by M,(ĝ,). The asymptotic null distri- 
bution of these statistics is derived from the following 


THEOREM 1. Under the assumptions of Theorem 2.2 


{S,(¥, p) dyol y) 
ff dv 


Proor. Follows from (3.4), (3.5), and (3.6). 0 


O) y(n) = SÈS) - HO)} dtl y) + 0,00), (Ho) 


Now write S( y) for S,( y, p). We shall throughout assume that H is symmetric 
around 0, For any process Y, define 


(2) IYI% = 2° YP aH. 
Then by the symmetry of H and F, the leading term on the right-hand side of (1) 
is ||Z,,||?, where, for all y real, 
(3) ZO) = SO) - a) {SO av), a) =H) S idto) - 
0 0 
Observe that 
(4) S(y) =n? EAX) s 9) + Hes —y)— 1], forall zeal». 
I 

Let 
(5) a(x, y)=I(xsy)+Mx<-y)-1, x, yreal. 
Then, under Hy, Ea(e,, y) = 0, all i, all y. Moreover, for x, y > 0, ES(x) = 0, 

ES(x)S(y) = Eh*(X,)Ea(e,,x)o(e, y) 
(6) = b2min(1 — F(x),1 — F(y)) (b= Eh(X,)), 

= 2bC(x, y), say. 
Then, by Fubini’s theorem 
K,(x, y) = Cov(Z,(x), Z,(y)) = E(Z,(x)Z,(y)) 
= ES(x)S(y) ~ ay) f BS(x)S(t) d¥o(¢) 


7 —a(2) [T ESSC) aol) + (EEEL [TSH dta) 
7 
= 26( C(x, y) = ay) f°CCx, t) dvo(t) = ale) f CC, À) dytt) 


+2og(a)aly) [7 [7 Cls, t) dels) d¥o(2) 


=: K(x, y), say. 


MINIMUM DISTANCE AUTOREGRESSION 1207 


Now, let W be a Wiener process on [0,1], W(0) = 0, EW=0, Cov(W(s), 
W(t)) = min(s, t). Define 


a(x) = {WRAN — F(2))) - a(x) [TWAC ~ F(C) dyli) ho, 


x20. 
Note that Cov(Z(x), Z(y)) = K(x, y), x, y 2 0. 
Now use Theorem VI.2.1, Parthasarthy (1967), as in Millar (1981), to conclude 
that 


Zalik = 2 Í "Z2(y) dH(y) = 2 | ®ZY( y) dH( y) = ||Z|[%;, under Hy. 


Consequently, by (1), 


(8) M,(6,) > ||Z\|j, under Ho. 
But 
o JEW(22U - F)) dyo} 
Z\\2, = 2b w2(271 - F a-i 
o Wi f weed - F)) itdi, 
:= bB(W, F), say. 
Consequently, if we estimate b by 
(10) ô = n! D h?(X,-1) 
E tml 
and consider 
(11) M,( B,) = M,(,)/8, 
then by (3.10) and the above discussion we have 
(12) Mô) = B(W, F). 


Thus, the first consequence of (12) is that the asymptotic level of M, ,-test is the 
same for every h. The second observation is that the limiting r.v. B(W, F) is the 
game r.v. that arises when testing for symmetry in the one sample location model. 
Its distribution is accessible as in Martynov (1975, 1976) and Boos (1982). 

Next, consider the problem of testing 


(13) H: F=f, Kaknownd?. 
A test of H is to reject H when Î is large, where 
(14) T= {[V(y, By) - nF y)]” doy). 


From the proof of (i) in Section 5 [see (5.8)—(5.18)] and Remark 5.1 one deduces 
the following 


PROPOSITION. Suppose that (1.1) and H hold. Assume that the error d.f. Fy 
has bounded and continuous density fọ, EX} < œ, EX, = 0, and that py is such 


1208 H. L. KOUL 


that |n”(p, — p)| = 0,(1). Then Î = {3B?(t) dt, where B is the continuous 
Brownian bridge on [0,1]. 


Consequently, the test of H based on T is asymptotically distribution free. A 
similar conclusion may be drawn about the test of H based on the 
Anderson—Darling statistic. 

Pierce (1985) indicates without proof that the asymptotic null distribution of 
any test of the normality of errors in (1.1) is the same as that of its counterpart in 
the one sample location scale model. This observation is consistent with the 
results obtained here. 


5. Proofs of Theorems 3.1 and 3.2. We shall first give a proof of Theorem 
3.1. The basic proof of this theorem is similar to that of Theorem 5.1 of K-D 
(op. cit.), but because of the dependence of {X,}, some calculations are intricate. 
We shall thus he brief, indicating only the differences. 

For any functions g and k from @xX@ to R, |g,|3,:= fey, t) dH(y), 
lge- kola = JLEC, t) — R(y, 8)]? dH(y), for s, t in 2. 


PROOF OF THEOREM 3.1. With {p, X, £ F} as in (1.1) and h as in Theorem 
3.1, define 


Q) Jy, t) = [h(x)F(y + ntx) d(x) (G df. of X,), 


(2) Wo, t) =n 7S (A(X, Mle, < y + nX, a) - Ay, 8), 


1al] 
y, tin 2. 


Since A is held fixed, we shall now write S, Q, M for the S,, Q}, M, of (3.1)-(3.3). 
Observe that 


(3) the left-hand side of (3.4) = E sup |M(tn™ 2 + p) — Q(n™ 1t + p)|. 
lts ò 
From (3.1) and (1), 


M(n-t + p) = f|wo, t) + W(—y, t) + WOT y, t) + I(~y,t)} 
-nE Rh(X,) * aH y) 


(4) = {[{W(y, t) - W(y,0)} + (W(-», t) - W(-y,0)} 


+{S(y, p) + 2taf(y)} 
+ {n'?[J(y, t) - J(y,0)] — taf(y)} 


+ {n 2[J(-y, t) - J(-y,0)] — taf(-y)}]° dH(y), 


MINIMUM DISTANCE AUTOREGRESSION 1209 


where a = ( {xh(x) dG(x)) and where we have used the fact that for all y, 
S(y, p) = W(y,0) + W(-y,0) - n P ERX, -1) + 7 [J(y,0) + I(—y, 0)I, 


which in turn follows from the definitions. Also note that 


(5) Q(n-t + p) = f[S(y, p) + 2atf( y)? dH(y). 


Using the quadratic expansion and the Cauchy—Schwarz inequality on the cross 
product terms one gets an inequality involving La( H) norms of the differences 
W(-,£)—- WC-,0) and JX(-, t) — H(-,0) — taf, just like the inequality (5.6) in 
K-D. Thus to prove (3.4) it suffices to prove 


(i) Esup|W, - Wii = 0(1), 


Gi) sup|n’*[,— Al- ati = oft) (a= fahla) dala), 


and 


(iti) Esup|S, + 2aéf|z, = O(1), 
t 
where the sup over T is over |t| < b. But (iii) readily follows from (A.2) and (A.5). 


Proor oF (ii). Define h*(x) = A(x)I(xh(x) > 0), h= h — h*. Let J+, a+ 
stand for the J and a of (1) and (4) when A is replaced by A+ so that 
J = J* + J”, a =a* + a`. By the inequality (b + c)? < 2b? + 2c? for all reals 
b,c, Gi) will follow if we prove it for J+. To begin with note that the 
Cauchy-—Schwarz inequality and Fubini’s theorem together with the fact that 
(ht)? < h?, imply that for fixed t, 


|n'2[ J+ — Jt] -atti ia 

a J frre) (eX? LF(y + tnx) — F(y)] - taf(y)}" dG (x) dH(y) 
satan 71) fT JELA) f(y + 8X0) = H(a)]}? dC) ds 
+0, by (A.2)-(A5). 

Now observe that J* (J~) is a nondecreasing (nonincreasing) function of ¢. This 


fact together with the compactness of [— b, b] and (6) yields (ii) in a standard 
fashion. This completes the proof of (ii). We now turn to the 


PROOF oF (i). Write W +t for W when h in W is replaced by A+. Define 
(7) P(y, t; x) = F(y + tnx) - Fly), y,x, ting. 


1210 H. L. KOUL 


Observe that for all y, t 
WE(y,t)- W*(y,0) =n? At(X,_,) {Ile < y + mX,_,) 
i 


-I(e, < y) — p(y, t; X,_1)} 
(8) +P {h+ (X,-)p(y, t; X;_;) J+(y,t) 


+J*(y,0)} 
=Ri(y,t)+Ri(y,t), say, 
where 
R#(y,t) =n P Eht(X,) 
(9) ' 
x (I(esy tn AE N Nas — p(y, 6; X,-1)}. 


Note that R£ is the sum of conditionally centered r.v.’s so that the covariance of 
any two summands is zero. Consequently, by Fubini’s theorem for every 
fixed ż, 


E\Rily < [E(h*(Xq)}'|F(y + 2X) — F(y)|dH(y) 


“0 sf" fh Xo) Xol f(y + 8X0) dH(y) ds 


> 0, by (A.4). 
Next, consider R£. Rewrite 


R#(y, t) =n E {h*(X,_,) [Fy + 7X.) - Fly)] 


—Eh*(X,_,)[F(y + mX,_,) - F(y)]} 
= nE n*(X,_,) [Fly + tn X1) = F(y) > in-'?X,_sf(y)] 


(11) -n E Eh*(X,_,)[F(y + m-X,_,) 


—F(y) - nX, f(y)] 
tin") [h*(X,_,)X,-1 — EA*(X,_1)X-1] f(y) 


= Ally, t) ~ Ay, t) T A,(y, t), Say. 
Fubini’s theorem and (A.1)-(A.3) imply that 


E sup |A, < 46°(2b/vn) 
jts è 


(02) x [OF JB [h*(Ko)Kol + Xo) = H(3)}]* aH) de 


+0, by (A.4). 


MINIMUM DISTANCE AUTOREGRESSION 1211 


Note that A,(¥, t) = EA (y, t). Therefore, (12) and Fubini’s theorem imply that 
(13) E sup |Ag|#, < E sup |A} > 0. 
itis itisb 


Next, because {X,} is stationary and ergodic and because of (A.2) and (A.3), 
Varn -25h *(X,1)X,-1} = 00), 

see, e.g., Hall and Heyde (1980). Therefore 

(14) Esup|A;,\2, < bn" fVar( n PATRO X} -> 


itisb 

From (11)—(14) it readily follows that 

(15) E sup |Ri|?, > 0. 
\t|<6 


From (8), (10), and (15) it follows that for each fixed t in [—b, b], 


(16) E|W,* — Wl > 0. 
To complete the proof of (i) observe that 
(17) |W, — Woli < 2{|W," — Wolk, + [We — Wold}. 


Now, let mae hae <.. <t,=b be a partition of [—b, b] such that 
max, <,<,(t, — t,_,) > 0 as r > oo. Observe that W, (Wp) is a difference of two 
nondecreasing (nonincreasing) functions of ¢. Therefore, 


sup |W,* — W,*|?, < af me max [WF - W+ + max 
lils b O<ysr 
so that (16) and (6) imply 

lim sup E sup |W,* — Wo*lz, 


n= \t]sb 





nal ag -ailih 


(18) < max (t,— t,-1) If l(E|Xoh*(X)|) x 16 
Osysr 
>0 asr> oo. 


This together with (17) entails (i) and hence Theorem 3.1. 0 


REMARK 1. It should be emphasized that the symmetry of F is not required 
for the proof of (i) and (ii). The symmetry of F is used only in proving (iii). In 
fact one can also prove an analogue of Theorem 3.1 when F is not symmetric. 
But calculations get involved. There will be an extra term in Q. In fact, the 
approximating Q will be 


Q,(n-7t + p) = {[S(y, p) - ES(y, p) + tag( y) + ES(y, p)? dH(y), 
e(y) = f(y) + f(-9). 


1212 H. L. KOUL 


We shall now sketch a proof of the asymptotic normality of s of (2.7). 
Accordingly, let 


Fily) =n E(X,- êX, sy), Fy) =Fly)- F(-y), 


and 
T.(y) = MPR y) -F*(y)), > 0. 
Assume that F* has the unique median y. Observe that for any real u, 
T (y+ un) = W(y,, È) — W- yn, Ê 
(19) WY ) (Yn ) A( Yr ) 


ELACA CA î) a Jh- Yn; Ê) ai F*(y,)], 
where f = n'/*(p, — P), Yn = Y + un 1/7, and W,, J, are the W and J of (1) and 
(2) above with h= 1. Using arguments like those used for the proof of (i) 
[see (8)-(18) above] one can show that if f is bounded and continuous and 
0 < EX? < œ, then for any real u, 0 < b < œ, 


(20) E sup |W,(y + un, t) - W,(y,0)| > 0, 
lts 


and 


sup |n'/?[ J,(y + un”, t) — F(y + un™””)] — tEX,f(Y)|> 0. 
\s6 


From (19) and (20), if |ê] = O,(1) then 
(21) Ty + un~?) = Sy) + HEX f(y) — f(-)} + 0,(2), 
where 

SY) = nE (lel s y) — F*(y)}- 


Moreover, continuity of f yields that, for any real u, 

(22) n| F*(y) — F*(y + un™™™®)] > -ul f(y) + FY). 

Observe that S,(y) is the standardized sum of i.i.d. Bernoulli (}) r.v.’s. Conse- 
quently, S,(y) = N(0,}). Now use the standard argument of converting the 


events based on the sample median to the events based on the corresponding 
empiricals and the above discussion to conclude the following 


COROLLARY. Suppose that (1.1) holds; the error d.f. F has bounded and 
continuous density f; F* has unique positive median y; n'/*|p, — p| = O,(1) and 
that 0 < EX? < œ. Then 


|n'(s - y)|= 0,(1). 


In addition, if either (i) EX,=0 or (ii) F is symmetric around 0, then 
n'/2(5 — y) = N(0, x?) r.v., where 


v»? = { f(y) +f(-y)}~”, incase (i) 
= {2f(y)}~*, in case (ii). 


MINIMUM DISTANCE AUTOREGRESSION 1213 


REFERENCES 


BERAN, R. J. (1977). Minimum Hellinger distance estimates for parametric models. Ann. Statist. 5 
445-463. 

BERAN, R. J. (1978). An efficient and robust adaptive estimator of location. Ann. Statist. 6 292-313. 

Boos, D. (1981). Minimum distance estimators for location and goodness of fit. J. Amer. Statist. 
Assoc. 76 663-670. 

Boos, D, (1982). A test for asymmetry associated with the Hodges~Lehmann estimator. J. Amer. 
Statist. Assoc. TI 647-651. 

CHENG, K. F. and SERFLING, R. J. (1981). On estimation of a class of efficiency-related parameters. 
Scand. Actuar. J. 83-92. 

DENBY L. and Martin, R. D. (1979). Robust estumation of the first-order autoregressive parameter. 
J. Amer. Statist. Assoc. 74 140-146. 

Fox, A. J. (1972). Outliers in time series. J. Roy. Statist. Soc. Ser. B 34 350-363. 

HALL, P. and HEYDE, C. C. (1980). Martingale Limit Theory and Its Applications. Academic, New 
York. 

Huser, P. J. (1981). Robust Statistics. Wiley, New York. 

Kou1, H, L. (1980). Some weighted empirical inferential procedures for a simple regression model. 
Collog. Math. Soc. Janos Bolyai, Nonpar. Statst. Inf. 32 637-565. 

Kout, H. L. (1984). Weighted Empiricals and Linear Regression Models. A monograph. MSU RM 
Series 439. 

KouL, H. L. (1985). Minimum distance estimation in multiple linear regression. Sankhya Ser. A 47 
67-74, 

Kout, H. L. and DeWer, T. (1983). Minimum distance estimation in a linear regression model. 
Ann. Statist. 11 921-932. 

LEHMANN, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San 
Francisco, 

Marty, R. D. (1981). Robust methods for time series. In Applied Tıme Series Analysis II (D. F. 
Findley, ed.) 683-759. Academic, New York. 

Marrynov, G. V. (1975). Computation of distribution functions of quadratic forms of normally 
distributed r.v.’s. Theory Probab. Appl. 20 782-793. 

Marrynov, G. V. (1976). Computation of limit distributions of statistics for normality tests of type 
w°. Theory Probab. Appl. 21 1-13. 

Miuiar, P. W. (1981). Robust estimation via minimum distance methods. Z. Wahrsch. verw. 
Gebiete 55 73-89. 

MILLAR, P. W. (1982). Optimal estimation of a general regression function. Ann. Statist. 10 717~740. 

Parr, W. C. (1981). Minimum distance estimation: A bibliography. Comm. Statist. A—Theory 
Methods 10 1205-1224. 

Parr, W. C. and DeWer, T. (1981). On minimum Cramér—von Mises norm parameter estimation. 
Comm. Statist. A—Theory Methods 10 1149-1166. 

Parr, W. C. and Scuucany, W. R. (1980). Minimum distance and robust estimation. J. Amer. 
Statist. Assoc. 75 616-624, 

PARTHASARATHY, K. R. (1967). Probability Measures on Metric Spaces. Academic, New York. 

Prerce, D. A. (1985). Testing normality in autoregressive models. Biometrika 72 293-297. 

SIEVERS, G. L. (1982). A consistent estimate of a nonparametric scale parameter. Inst. Statist. 
Mimeo series 1601, Univ. of North Carolina, Chapel Hill. 

Wana, C. W. H. (1986). A minimum distance estimator for first-order autoregressive processes. Ann. 
Statist. 14 1180-1193. 

WILLIAMSON, M. (1979). Weighted empirical type estimation of the linear regression parameter. 
Ph.D. thesis, Michigan State Univ. 

Wo .rowrTz, J. (1957). Minimum distance method. Ann. Math. Statist. 28 75-88. 


DEPARTMENT OF STATISTICS 
MICHIGAN STATE UNIVERSITY 
East LANSING, MICHIGAN 48824 


The Annals of Statistics 
1988, Vol. 14, No. 8, 1214-1225 


SECOND-ORDER RISK STRUCTURE OF GLSE AND MLE 
IN A REGRESSION WITH A LINEAR PROCESS! 


By YAsSuyUK!I TOYOOKA 
Osaka University 


In a regression model with an error that is a general linear process, the 
second-order expansion of the risk matrix of GLSE or MLE is obtained. A set 
of sufficient conditions for the effect of estimating the structural parameter of 
the linear process to vanish in the above expansion is obtained. The relation 
of the covariance matrix of SLSE with those of GLSE and MLE up to 
O(T~*) is elucidated. 


1. Introduction. In this paper we consider the estimation of 8 when @ is 
unknown in the regression model 
(1.1) y= XB + Uy, 


where {x,} is a sequence of p-dimensional fixed designed vectors, 8 € RP, and u, 
is a general linear process, 


(1.2) u= E Oe 


with g,(0) = 1, E% o g,(0} < œ, and {e,} is a sequence of iid. N(0, 07) random 
variables, where 


1 
o? = 2resp| z f log f(A) an) 
with 


2 2 


(13) fol) = =| E (Be) = S-a(A,0) (y). 


jn 








The parameter space of @ is an open subset @ of R, As is well known, the finite 
parameter stationary models, such as the autoregressive model, moving average 
model, and autoregressive-moving average model, can be expressed as the general 
linear process {u,} in (1.2). 

When {y,; ¢=1,...,7} and {x,; t=1,...,7} are observed, the statistical 
linear model is 


(1.4) y=XB+u, E(u)=0 and Cov(u) = V(6), 





Received March 1984; revised August 1985. 
Research partially supported by U.S. Bureau of the Census through Joint Statistical Agreement 
J.S.A. 84-2. 
AMS 1980 subject classifications. Primary 62F10; secondary 62J10. 
Key words and phrases. GLSE, Grenander’s condition, linear process, MLE, regreasion with a 
linear process, second-order risk, SLSE. 


1214 


RISK STRUCTURE OF GLSE AND MLE 1216 


where y=[,..., yr]; X = [x,,..., 27]’, and u = [u,,..., ur]. Assume that 
rank(.X) = p if T > p for simplicity of discussion. The covariance structure V(@) 
does not generally satisfy the condition (Mitra and Rao (1969)) for which the 
simple least squares estimator (SLSE) Ê = (X’X)~1X’y is equivalent to the best 
linear unbiased estimator (BLUE) 
(1.5) By = (X'V-(8)X} XV- 0). 

Since @ is unknown, alternative estimators other than SLSE are the generalized 
least squares estimator (GLSE) and the maximum likelihood estimator (MLE). 
When {u,} is a stationary autoregressive process with parameter 9, Toyooka 
(1985) proved that the maximum likelihood estimator (MLE) Ay is a GLSE of 
the form 


By, = {X’V-*(4,(z))X}-'X'V-(8,(a)) 9(= Bua), 


where ĝ (ñ) is some function of i = [I — X(X’X)~1X’]u. The risk matrix of a 
usual GLSE 


By, = (X’V-(0,(@))X} 'X’V-(6,(z)) y, 
where 


T T 
ô (ù) = L ŭi, D tt}; 
t=2 t=2 


is equivalent to that of By. up to O(T~?) in the previous paper. Moreover, 
Toyooka gave sufficient conditions for the estimation effect of @ to vanish from 
the expansion of the risk matrix of Bw, up to O(T~*). 

In the present paper we examine the above sufficient condition under the more 
general error process (1.2). In Section 2, we formulate the problem. We give the 
second-order expansion of the risk matrix of GLSE or MLE and the sufficient 
condition for the estimation effect of 0 contained in this expansion to vanish in 
Section 3. In Section 4 we give the statistical implication of the sufficient 
condition. An extension to the case where @ is multidimensional is straightfor- 
ward, and therefore is omitted. 


2. The estimator of the structural parameter 6. Let the estimated resid- 
ual be 


i= y— XB 
= {I- X(X'X) 'X’}u. 
We use the Whittle functional for ù (see Walker (1964)), that is, 


T 2 
£ üe™ 
t=1 


(2.1) 


AOD i (80%, 0)} aa 








(2.2) 
=T E (8), 


&=-—{T-1}) 


1216 Y. TOYOOKA 


where 
1 
een Às =1 
a,(9) = 5— f e*(a(A,8)} ‘dd 

and 

T—|s| 
(2.3) G= È Ülj T. 

t=1 


As the estimator Ê of 0, we use the value of 8 that minimizes Ur(ŭ, 0) with 
respect to @. 
For the linear process {u,} defined in (1.2), the covariance matrix of u is 


(2.4) V(O) = | f” eA) aA 
-e f,6—1,...,T 
So our GLSE is . 
(2.5) By = {X’'V-(6)xX} X Vô) y. 
First we get 


Lemma 2.1. The estimator 6 for the structural parameter is an even function 
of it. 


Proor. From (2.2), the normal equation is 


a T~1 
5g Uri, 8) = T E —=a,(8)C, = 0. 


Then from the implicit function theorem, Ê is a function of C, (s = 
—(T — 1),...,(T — 1)), which implies that 8 is an even function of ù. O 


Moreover, 
LEMMA 2.2. The GLSE By is an unbiased estimator for B. 


Proor. Since V(@) is a continuous function of 0 and 6 is an even function of 
ú, Êw is an odd function of u. Therefore, E(By) =6.0 


An interesting lemma by Kariya and Toyooka (1985) is 
LEMMA 2.3. For the GLSE By, 
E|(Bw- Bw)(Bw- B)'] = 9, 
where By = {X'V~(0)X}~!X’V-\(6)y is BLUE. 


RISK STRUCTURE OF GLSE AND MLE 1217 
From Lemmas 2.2 and 2.3, 
E[(By — B)(By— B)’] = Cov(By) 

= Cov( By) + E[(Bw — Bw)(Bw - Bw)’. 


The first term is the covariance matrix of the BLUE and the second term is the 
estimation effect of 8. 


(2.6) 


3. Asymptotic evaluation of E((by - Baby - Bw)’l- One main object 
is to evaluate the leading term of the second term of (2.6). We consider the 
situation in which {x,} is a sequence of bounded designed functions of t. 

Let, for i, 7 = 1,..., P, 

T-h 
Gi (h) = È ax tis h=0,1,... 
t=] 


T 


= È, tuten = 0,-1,.... 
tml- 


We impose the following regularity conditions on the regression functions {x,} 
(see Grenander (1954)): 


T a7 (0) = lx, || > œ as T > co, where e llr = (OF x2)? for i = 1,..., p. 
2 limp- ot; (T+ \/a7,(0) = 0 for i = 1,. +» D. 
R3 The limit of 


al (h)/T=yi(h) a8 T > o 
exists for every i, j = 1,..., p and h = 0, +1,.... Let 
Jim ¥5(4) = (2) 


for i, j= 1,..., p and h = 0, +1,... and let R(A) = [p,,(A)]- 
R.4 R(0) is nonsingular. 


Then under these conditions, there exists a matrix-valued regression spectral 
measure M(A) such that 


(3.1) R(h) = fe” dM(a). 
From Walker’s result (1964), we have the following: 


LEMMA 3.1. If ETZ p_(9/00)a,(0) = o(TVT), then as T > œw, YT (Ê — 0) 
—> p N (0, w™*), where D denotes convergence in distribution and 


EPI aA ak 
-Z f (Freger,0)} : 


1218 Y. TOYOOKA 














Proor. Let 
1 2 
ERA tAt 
U,(u, 8) eae E we {g(d, 0)} dA 
T-1 
= T 2 a,(8)C,, 
s= -(T-1) 
where 
T—|s| 
= pa UU esis /T 
and let 
r 2 
Up(ü, 0) = — f" | X e™| {a(d,8)} d 
7| to} 
=T L a,(9)C,, 
s= —(T-1) 
where 
T—|s| 


Let 6 be the minimizing value of U,(u, 0) and 6 be the minimizing value of 
Ur(ŭ, 0). Then 


T-1 
T È a,(8)C, = 0 
a= —-(T—1) a0 
and 
T-1 3 
T 2 a )Č, =0. 
s= -—(T—-1) 20 
Therefore 
T 1 82 JT os f] 
— —-a,(6*) 6-8)= —a,(9)C, 
s= ~(T- ie A( a VT s+ —(T-1) 06 
where 0* = 4,0 + (1 — ~ and 
T taco r(o- 0) ere 
~ -—a, (0**) 6) =—= a, 6)G,, 
s=—(T-1) T 7a 7 s=- r- 98 


RISK STRUCTURE OF GLSE AND MLE 1219 


where 6** = \,6 + (1 — A,)6). From Walker’s result, 
1 T-1 2 T-1 g2 


=pim= D a (0**) 
T s={T-1) 06? 
= 2o07w. 
Remark that 
C, ay Č, = 0,(1/T ) 
and 
1 T-1 ð 1 T-1 f] 
a 6 Č = —a (8 Č, ~ C, 
yT a i gel ) DES FT hae )( ) 
trigo a (8 1c: 


From this, if E7} (p_(0/90)a,(8) = a then 


1 T~1 


a,(9)C, ~ a,(8)C,. 


WF h. 5 ae 7 2 6" 


Therefore, from Walker’s result for the asymptotic normality of the right-hand 
side, 
YT (6 — 6) ~ N(0,1/w). o 
From the fact that the parameters 8 and 0 are orthogonal in the Fisher 
information matrix sense, we can get, by using a discussion similar to Hildreth 
(1969), 


LEMMA 3.2. AsT > œ, 


T'7(6 -60) 
TOUAXV-(O)u 


ð 
TAX V-l(@)u 
wo) 0 e 
0 lim TX’ V (0)X im T X's, “v-(@)x 
T=% T— œ 
8 ð 
0 Him T1X'—V"1l(6)X im mix voveo 1(0)X 
gaV (OX ji Txa VOVO) gO) 


T>œw 


By using these results and applying an argument similar to that of Toyooka 
(1985), we obtain 


1220 Y. TOYOOKA 
LEMMA 3.3. The second term of (2.6) is 


E[(Bw~ Bw)(Bw- Bw)'] = Tw lim T{X’'V-(0)X} 7 
x| lim 71x! y-1(6) v0) v-{0)X 
T>% 06 a6 
a af 

— i -1Y y1 è y-1 

(3.2) jim T X'S. V-(G)X lim T(X’V-(4)X} 
x lim TX V-1(9)x 
Too 00 


x lim T(X'V-(4)X} "+ o(T-2). 


We remark that 
a -1 = — 1 4 =] 
FTM (8)= -V (0) 5, V0) (0) 
and 
a -1 = l ð —1 ð —1 
gga V O) = 2V-(8) = VCO) V NO) — V(a)V-1(8) 
92 
~ V8) zg V(8)V (8). 
Then 
ð 0 
XV) VOV (8) = VOV 8) X 
5 BVO eV MOV) : 


The limiting behaviour of the first term of (3.3) is evaluated as 
3 -1y g? =} 

ci jm Tt X FTM (6)X 

= S 2AA = KOAA fol) MA), 
where f(A) = (07/007) f(A) and f,(A) = (3/30)fa(À). This expression can be 
simplified when we assume that M(A) has only one jump point at A = 0 for 
which the jump is AM(0) = R. Then (3.4) is 

—fe"(O) fo(0)* AM(O) + 2 f,'(0)" fo(0) 7” AM(0). 


RISK STRUCTURE OF GLSE AND MLE 1221 


On the other hand, 
ð x a 
lim T- 1X8) X jim T{X'V-{0)X} jim y= Kg vV-{0)X 


T=% 


(B5) (2 F M MNS aN 
ZS INM) 


= fø (0)? fe(0) °AM(0) 
under the condition M(A) has only one jump point at A = 0. Therefore, the term 
in square brackets in (3.2) is, from (3.4) and (3.5), 


lim T“ 1X V-)Y(O) 2, V-H(A)X 


T -> œ 


a 
— lim T- Pa Ox Jim T{X'V-{0)X} 


T—> 00 
x lim TX’ “V1(0)X 
(3.6) = -= = f6"(0)fo(0)? AMO) + fe (0)° fa(0) 7 AM(0) 
+5 lim T- vo) a 
— fø (0) (0) eo, 


jm TXV-1(0) 5 WOVE - fa” (O) fo(0) *am(o)|. 


V(0)V-0)X 


In order to pace limpa T ere T eae let 


T-h 
bj,(A) g 2 22 yt hs h = 0,1,.. 
t=] 
T 
= B ean  h=0,-1,..., 
t=1+h 


with Z = V~1(0)X. We assume the following: 


= br (0) = lal > œ% as T > œ. 
2 lim o Zor 41/ b5 (0) = 0 for i = 1,. ++) P. 
3.3 The limit of 


. bi (h)/T = gi (h) as T > œ 
exists for every i, j = 1,..., p and h=0,+1,.... Let 
lim 96,;( 2) = dalh) 
T= œ 
for i, j = 1,..., p and k = 0, +1,... and let Q(A) = [go (A)]. 
S.4 Q,(0) is nonsingular. 


1222 Y. TOYOOKA 


Then there exists another regression spectral measure N,(A) such that 
Qe(h) = f e®* dN QA). 
So 
a? 2 
(3.7) Jim T™1X/V-1(8) 5 V(O)V-1(8) X = f FOA) AN,(A). 
Therefore, we obtain the integral representation such as 


THEOREM 3.1. If the conditions R.1-RA and S.1-8S.4 hold, then as T > œ 


E|(Bw- Bw)(Bw - By)’ 
= Tw f(a) MA) 


=1 


x 





S fA) AY? MOA) -4 f fe) f(A)? aM (A) 
+a fE KOVINO) - (f" KOA MO) 
x(S OTM] | S KOAM] 

x{ J" KOT M) + o(1/T?). 

If the regression function {x,} satisfies the condition that M(A) jumps at 
A = 0 only, we obtain 

THEOREM 3.2. As T > 00, 

E[(By - Bw)(Bw- Bw)'] 

= T~*w"'f,(0)~* AM(0)~"273 
x | JT HCA) ANCA) ~ f6"(0) f(O) AM(0)| AM(O)™* + o(1/7°) 

under the condition that M(X) has only one jump point at A = 0. 

On the other hand, 


Lemma 3.4. If N,(A) has only one jump point at À = 0 for which the jump is 
f,(0)~2R, then (3.7) is 
g (O) fy(0)°R. 
Moreover, 


Q(0) = f (0) °R. 


RISK STRUCTURE OF GLSE AND MLE ` 1223 
By using this lemma, we obtain 


THEOREM 3.3. Under the conditions R.1-RA and S.1-S.4, if M(A) and 
N,(A) each have only one jump point at À = 0, where the jumps are R and 
j,(0)~?R, respectively, then the coefficient of T? in (3.2) vanishes. 


COROLLARY 3.1. Under the same conditions as in Theorem 3.2, 
E[(Bwy— B)(Bw- B)’] - Cov(By) = 0(1/T?) as T > œ. 


REMARK 1. Since y — XBw = [I — X{X’V-\@)X}~!X'V-\(6)]i, which de- 
pends on y only through i, mue is a function of Z only. So Back is a GLSE 
even in the present situation and 6,,,, has the same asymptotic distribution as 
6. So (3.2) for Bary is identical to that for By. 


REMARK 2. From the proof of Theorem 3.1, the leading term of 
E Êy — bwX Êw ~ wy] has the same expansion (3.8) whenever 6 is an 
estimator of a class of best asymptotically normal estimators. This point is also 
discussed in the point of the prediction framework (see Toyooka (1982)). 


REMARK 3. Under the conditions of Theorem 3.2, there is no difference 
between the asymptotic value of the covariance matrix of B and that of Êw 
which is equivalent to that of By. This fact is a special statement of Grenander’s 
(1954) result. Moreover, under the conditions of Theorem 3.2, there is no 
difference between the asymptotic value of the covariance matrix of By and that 
of By up to O(T~?). Of course both matrices are smaller than that of Ê up to 
O(T~*) (see Toyooka (1985)). 


4, Implications of Theorems 3.2 and 3.3. Under the conditions of Theorem 
3.3, we can compare the risk matrix of Ê with that of Bw or Bane as stated in 
Toyooka (1985). 

In the first-order autoregression with the autoregressive parameter 8, after 
simple calculation, 


Q,(h) = (1 — 40 + 60? — 403 + 64)R 
= f e* dN,(A), 


which does not depend on h. Therefore, N,(A) has only one jump point at A = 0 
and the jump is 


hO) R = (1 — 40 + 602 — 463 + 64)R. 


So the structure of the autoregression automatically satisfies the condition for 
NA). 

The case in which the error {u,} is a second-order autoregression is a special 
case of our model (1.2). Pantula and Fuller (1985) compare the empirical risk 
matrices of two estimated generalized least squares estimators for a linear trend 


1224 Y. TOYOOKA 


model with second-order autoregressive error in a Monte Carlo experiment. The 
estimator Ê w, is based on autoregressive parameters estimated by using the 
ordinary least squares residuals and Êw, is based on a bias adjusted estimator for 
the autoregressive parameters. Their experimental results agree with our theory 
in that there is little difference between the risk matrix of By, and that of By, in 
the most cases. They find that the bias adjustment procedure for the autoregres- 
sive estimator is effective in small samples (n = 25) for processes with a large 
positive root. Our theory says that the effect of the bias adjustment procedure in 
the autoregressive parameters exists in the o(T-?) term of expansion (3.8). Their 
experiment indicates that the high-order asymptotic expansion provides better 
approximation to the small sample behavior of the estimator in the interior of the 
parameter space. 

From the results of Toyooka (1983), (1985) and the present results, the 
following facts were elucidated. In the case where the regression function {x,} 
does not satisfy the Grenander condition that M(A) increases at not more than 
p values of A, 0 < A < =, and the sum of the ranks of the increases in M(X) is p, 


Cov( 8) — Cov( By) = O(T~") 
and 


Cov(By) - Cov( By) = O(7~*). 

On the other hand, under Grenander’s condition, 
Cov(B) — Cov( By) = O(T~*), 

and, moreover, if M(A) and N,(A) satisfy the condition of Theorem 3.2, 
Cov( By) - Cov( By) = o(T-?). 


It is not obvious whether the last equality is O(T-?) or not. An extension to the 
case where @ is multidimensional is straightforward and therefore is omitted. 


Acknowledgments. The author would like to express his hearty thanks to 
Professor W. A. Fuller of Iowa State University for insightful discussions. He 
would like to express his thanks to the referees and Associate Editor for 
improving this paper. 


REFERENCES 


GRENANDER, U. (1954). On the estimation of regression coefficients in the case of an autocorrelated 
disturbance. Ann. Math. Statist, 25 252-272. 

HILDRETH, C. (1969). Asymptotic distribution of maximum likelihood estimation in a linear model 
with autoregressive disturbances. Ann. Math. Statist. 40 583-594. 

KARIYA, T. and Torooxa, Y. (1985). Nonlinear versions of the Gauss-Markov theorem and GLSE. 
In Multivariate Analysis VI (P. R. Krishnaiah, ed.) 345-354. Elsevier, New York. 

Mirra, S. K. and Rao, C. R. (1969). Conditions for optimality and validity of simple least squares 
theory. Ann. Math. Statist. 40 1617-1624. 

PANTULA, S. G. and FULLER, W. A. (1985). Mean estimation bias in least squares estimation of 
autoregressive processes. J. Econometrics 27 99-121. 

Toyooxka, Y. (1982). Prediction error in a linear model with estimated parameters. Biometrika 69 
453-459. 


RISK STRUCTURE OF GLSE AND MLE 1225 


TOYOOKA, Y. (1983). Second-order structure of mean squared error of generalized least squares 
estimator adjusted by estimated residuals in a location model with autoregressive error. 
Math. Japon. 28 163-171. 

Toyoora, Y. (1985). Second-order risk comparisons of SLSE with GLSE and MLE in regression 
with serial correlation. J. Multivariate Anal. 17 107-126. 

WALKER, A. M. (1964). Asymptotic properties of least-squares estimates of parameters of the 
spectrum of a stationary nondeterministic time-series. J. Aust. Math. Soc. 4 363-384, 


DEPARTMENT OF APPLIED MATHEMATICS 
FACULTY OF ENGINEERING SCIENCE 
OSAKA UNIVERSITY 

TOYONAKA, OSAKA 560 

JAPAN 


The Annals of Statistics 
1966, Vol. 14, No. 3, 1226-1233 


BAYESIAN STATISTICAL INFERENCE FOR SAMPLING A 
FINITE POPULATION? 


By ALBERT Y. Lo? 


SUNY at Buffalo 


Bayesian statistical inference for sampling a finite population is studied 
by using the Dirichlet-multinomial process as prior. It is shown that if the 
finite population variables have a Dirichlet-multinomial prior, then the post- 
erior distribution of the unobserved variables given a sample is also Dirichlet- 
multinomial. If the population size tends to infinity (the sample size is fixed), 
sampling without replacement from a Dirichlet multinomial process is equiv- 
alent to the iid sampling from a Dirichlet process. If both the population size 
and sample size tend to infinity, then given a sample, the posterior distribu- 
tion of the population empirical distribution function converges in distribu- 
tion to a Brownian bridge. The large-sample Bayes confidence band interval 
are given and shown to be equivalent to the usual ones obtained from simple 
random sampling. 


1. Introduction. In an important article on sampling a finite population 
from a Bayes viewpoint, Ericson (1969) showed that if the prior distribution of 
the number of population variables belonging to the jth category, J = 1,..., k, is 
Dirichlet-multinomial (Mosimann (1962)), the posterior distribution of the num- 
ber of unobserved population variables (given the observed ones) belonging to 
each category is also of the same type. The same result was also obtained by 
Hoadley (1969). Scott (1971) proved that the centered and rescaled posterior 
distribution converges to that of a normal random vector for arbitrary priors. In 
this note, we extend these results to the case of arbitrary population variables. 
This is accomplished by using the Dirichlet process introduced by Ferguson 
(1973). In Section 2 we define the Dirichlet-multinomial process, extend the 
Ericson—Hoadley theorem, and show that the Dirichlet process is the limit of the 
Dirichlet-multinomial process. In Section 3, we prove that the large-sample 
posterior distribution of the population empirical distribution, centered and 
rescaled, is a Brownian bridge with a change of time scale. As corollaries, the 
large-sample Bayes confidence band for the population empirical distribution 
function and the large-sample Bayes confidence interval for the population mean 
are obtained. Some difficulties in Binder (1982) are resolved. 


2. The posterior distribution. Let F be a Dirichlet process on a complete 
and separable metric space R with index measure a (see Ferguson (1973)) denoted 


Received July 1984; revised September 1985. 
'This research was supported in part by NSF Grant MCS 81-02523-01. 
2? First draft was done while the author was at Rutgers University. 
AMS 1980 subject classifications. Primary 62G30; secondary 62G05. 
Key words and phrases. Finite population, Dirichlet-multinomial process, prior and posterior 
distributions, limiting posterior distribution, Brownian bridge. 
1226 


BAYESIAN STATISTICAL INFERENCE 1227 


by F ~ D(a), and given F and any positive integer N, let X,,..., Xy be an iid 
sample from F. The marginal distribution of X,,..., Xj, is_symmetric and 
depends upon N and a. We denote this by X,,..., Xy ~ DM (N; a). Let 
{1,..., N} = S + S where S N S = ¢ and let the number of elements in S be n. 
Denote N — n by m. Let 


(2.1) N(-) = L8x(-) = m(-) + n(-), 


where m(-) = L,eg8x(-), nC) =L,e95x(-) and ô, is a point mass at x. 
Let F\(-) = n~1n(-) be the empirical distribution function of {X,, s € S}. Let 
Hy(-) = NNC). 


DEFINITION 2.1. Suppose X,,...,.X, ~ DM (N; a). Denote the distribution 
of N(-) by DM(N; a). 


The conditional distribution of {X,, s € S} given {X,, s € S} is characterized 
by the following theorem. 


THEOREM 2.1. Suppose X,,..., Xy ~ DM (N; a). Then given {X,, 8 E S}, 
{X,,8 E S} ~ DM (m;a + n(-)). Hence m(:)|{X,, s € S} ~ DM(m; a + n(-)). 


Proor. Since X,,..., Xy is exchangeable, {X, i € S}|{X,,i€ S} =, {X,, 
n+1si< N}\{X,1st<n}. Hence, F\{X,1<isn} ~ Diat+n(-)) im- 
plies {X,,i € S}|{X,,i€ S} ~ DM (m; a + n(-)). 0 


The following result gives the posterior distribution of m(-) restricted to a 
subset of R and is equivalent to Theorem 5.3 of Hoadley (1969). The proof is 
inserted for completeness. Let B be a measurable subset of R. For any measure p 
on R, let up be the restriction of u to R — B. 


THEOREM 2.2. Suppose X,,..., Xy ~ DM (N; a). Then given {X,, 8 E€ S} 
and m( B), m;(-) ~ DM(m — m(B); ag + np(-)). 


Proor. Suppose {X,, s € S} is given. Then F ~ D(a + n(-)), Fa/F(R — B) 
~ Diag + n,(-)), and {X,,8 E S}|F are iid F. Hence, given F and m(B), the 
X,’3 with s € S which do not fallin B are iid F,/F(R — B). Since F/R — B) 
is independent of m(B), Fp/ FXR — B)\m(B) ~ Diag +n,(-)). The last two 
statements imply the result. 0 


Theorem 2.1 specializes to the Ericson—Hoadley theorem as can be seen as 
follows: Let A,,..., A, be a partition of R, i.e., partition R into k categories. 
Then N(A,), j= 1,...,% is a Dirichlet-multinomial vector with parameters 
N; a(A,), j =1,...,% in the sense of Ericson (1969) and a compound multi- 
nomial vector_according to Hoadley (1969). Since given {X,, s € S}, 
{X,,8s E S} ~ DM (m; a + n(-)) by Theorem 2.1, the posterior distribution of 
m(A;), j= 1,...,%, given {X,, 8 E€ S} is a Dirichlet-multinomial vector with 





1228 A. Y. LO 


parimeterh m; a(A,) + n(A,;), j=1,...,%. The last statement with 
a = ys y~i%0y, Where y, j=1,...,& are the (distinct) categorical values of the 
population variables and a,2 0 implies the result of Ericson and Hoadley. 

The next result states. that sampling without replacement from a finite 
population with a Dirichlet-multinomial prior is equivalent to the iid sampling 
from a Dirichlet process if the population size is large. Denote convergence in 
distribution of random variables by >, 


PROPOSITION 2.1. Let M, ~ DM(n; a). Then n™' M, >, F where F ~ D(a). 


Proor. Let F~ D(a) and given F, Y,,...,Y, are iid F. Then Lfdy ~ 
DM(n; a). Let H, = n7'L?8y. For each (oneastirable) subset A of R and each F 
P(H,(A) > F(A)IF} = 1. By Fubini’s theorem H,(A) — F(A) as. and simi- 
larly, H,(A,), j=1,...,k > F(A,), j=1,...,% as. for each collection 
{A,, j=1,..., k}. It follows that H,(A,), j=1,...,k >, F(A,), J= 1,..., k. 
Since R is a complete and separable metric space, Theorem 3.17 in Matthes, 
Kerstan, and Mecke (1978) implies the distribution of H,„ converges to that of F. 
0O 


The above proposition together with Theorem 2.1 can be used to obtain some 
Bayes estimates of Ferguson (1973). For example, since m(-)|{X,,s € S} ~ 
DM(m; a + n(-)), by Proposition 2.1 and for a fixed sample size n, 

(2.2) E[Hy(A)|X,, 8 € S] > E[F(A)|X,, 8 € S], 
and if f|xja(dx) < oo, 


(2.3) E| [xtiy(de)1X,, se s| > E| f2F(d)X,, seS|, 


where F given {X,, s € S} is a Dirichlet process with index measure a + n(-). 
Hence, 
a( A) n 


(2.4) ELM Ax, 8€ 8] = py at aR) an 


F(A), 


and if {|x|a(dx) < œ, 


1 n 
Ryn) + a 


where X is the sample average. If the X’s can only take values in a finite set, 
Ericson (1969, pages 212 and 213) established the above results by computing the 
posterior means of functionals of Hy first and then letting N — œ for a fixed n. 
His expressions for the posterior means are applicable in our case as well (with 
obvious change of notation) and will not be reproduced. 


(2.5) E| [xF(dr)X, sé s| = 


3. The large-sample posterior distribution. Our next result deals with 
the large-sample posterior distribution of Hy. In the following, we assume that R 
is a finite g-dimensional cube in a Euclidean space, say R = [0,1]%. For brevity 


BAYESIAN STATISTICAL INFERENCE 1229 


we denote u((0, x]) by a(x) for any measure u on [0,1]? and for each x € [0,1]%. 
We first compute the posterior mean and variance of Hy(x) = NEPI, x, <x) 
where < is the coordinatewise inequality. 

Since N(-) ~ DM(N; a) and H,(-) = N~'!N(-), from Mosimann (1962) or 
Ericson (1969) we find 


E[Hy(x)|X,,8 € S] = TOES 

(3.1) 
x [a +N) E ôx, (x) + mala) 

and 

ar[ Hy(x)|X,, 6E S] 

_ ( m\f{ a(x) + Z esôx, (x) 
(3.2) 7 (5) a(R) +n | 
$ a(R) — a(x) +n — E, asôx, (x) N + a(R) 
a(R) +n l+n+a(R)]° 


Let us denote the centered (at Ê n) and rescaled Hy by Y, 


i 
mn? 1.€., 


Ya(2) = (= ~)" (Hy(x) - B(2)) 
= ly 


(3.3) M O — B.(2)} 
i ant) ja mal 1), 
where 
mn \1/2 x 
Paala) = (E) {mm tm(x) - F(x)}, 
(3.4) ena) = (ZE) (B(x) - A0), 


a(x) 
a(R)+n j a(R) + pial). 


The main result of this section is that Y,,,(-) coverges in distribution to 
a Brownian bridge on [0,1]%. This convergence can be established by showing 
that Y,,,(-) converges in distribution to the same limit since sup,|é,,,(x)| < ° 
2a(R)n~'/?, We first establish the tightness of Y,,,(-). Denote weak convergence 
of measure by >. 


F(x) 


LEMMA 3.1. Given {X,, 8 € S}, the sequence {Yn} is tight in D[0,1]? if Fis 
continuous and Ê, = F. 


1230 l A.Y. LO 


Proor. Denote a(R)+ N by a, and a(R) + n by a,. We use the fluctua- 
tion inequality of Bickel and Wichura (1971) to show tightness. Let B, C be two 
neighboring blocks; then 

= n n \2 
EY,(B)V3,(C) = ( 3) #{[m(B) - mB) ) 


m 


(3.5) 
x E[(m(C) - mË (C)ĵm(B)]}, 


where E denotes conditional expectation given {X,, 8 € S}. Let A = (B + CY. 
According to Theorem 2.2 (or Theorem 5.3 of Hoadley (1969)), given m( B), the 
joint distribution of m(C) and m(A) is a Dirichlet-multinomial vector with 
parameters m — m(B), n(C), and n(A). Therefore the moment expression in 
Mosimann (1982, equation (15)) can be applied to obtain 


E|(m(C) - mF(C))'m(B)| 


(m — m(B))F,(C) 

















=E (ao) DEE | Im(B) 
BC 2 
+{[m — m(B)] oe -= m0) 
(3.6) , #(c) 1 +a, Ë (C) 
= [mf,(B) - m(ByP a | | 
X F,(C)F,(A) 
+ [m (B) - m(B)|(ay + ™) ETAs C) +a,F(A 40) 
ayml,(C)F,(A) 
l+a,F(A+C)’ 
Hence, 


E{[m(B) — mF,(B)]*[m(C) — mF(C)}”} 
FC) 1 + a,F,(C) 
F(A+C)1+4,F(A+C) 
FA(C)F(A) 
F(A+C)1+a,F(A +C)) 
aym (C) F(A) 
lt+a,F(A+C) 





= E[m(B) — mF,(B)]* 





(3.7) —E[m(B) — mF,(B)|*(ay + m) 


+E[m(B) - mF,(B)]? 


=I + I+II. 


From Johnson and Kotz (1969, page 231), we find 











BAYESIAN STATISTICAL INFERENCE 1231 
i an 1 1 
E[m(B) ~ mA (B) = mi, B)E,(A + O| E) || 
13:8) x {(ay + m)(ay + 2m)[1 - 3F(B)R(A + C)] 


+a,(m — 1)[1 + 8ayF(B)F(A + C)]}, 














E[m(B) - mË,(B)]? = mi B)R(A + o| wl l 


a, +t 1l 
(3.9) (2 
x 
a, +2 





JIA) -Ē(A+0)], 


and 





(3.10) E[m(B) — mF(B)]? = mF (B)F(A + o| 7 "y i ). 


Substitute (3.8) into I to obtain 





aym \? ` i 
I< | | F(B)F(C) ifm>2andn> 2. 
a 


n 


Similar substitutions for II and III yield 





2 
I< (2) F(B)#(C) 
and 





Ill < | | ARO. 


Qn 





m\? _ 
I++ < 13/8 l F(B)E(C), 
a, 
implying EY2,(B)¥2,(C) < 18 (1+ a(R))?F(B)F(C), provided m>2 and 
n = 2. By the extended version of Theorem 3 in Bickel and Wichura (1971), 


{Ënn} is tight. O 


Let B(x) be a Gaussian process with zero means and covariance cov(B(t), 
B(s)) = F(min (t, s)) — F(t)F(s) where the minimum is computed coordinate- 
wise. : 

THEOREM 3.1. Suppose F is continuous and Ê, = F, then given {X,, s € S}, 
Yani’) >; BC) in D[O,1]? as m > œ and n > œ. 


Proor. According to Scott (1971), the finite dimensional conditional distri- 
bution of Y,,(-) converges to that of B(-). Hence, the finite dimensional 


1232 A. Y. LO 


conditional distribution of Yn") also converges to that of B(-). An application 
of Lemma 3.1 entails Y,,,,(-) >; BC), implying Y,,,(-) >, BC) 0 


If q = 1, Theorem 3.1 and the continuous mapping theorem can be applied to 
give the following corollaries: 


COROLLARY 3.1. Under the assumptions of Theorem 3.1, 


00 
(3.11) lim P{ sup BARES) > A|X np sE s} =2 £ (1P e”, 
Thy = O<ts1 


j=l 


According to Corollary 3.1, a (1 — «) large-sample Bayes confidence band for 
Hy is given by 
1-f\# 
n 


where 1 — f = 1 — n/N is the finite population correction factor (fpc) and À is 
the (1 — a}-percentile point of sup,|B(s)| defined by 22° ,(—1)/*? exp? N) = a, 
Deleting the fpc, the last band becomes the Kolmogorov—Smirnov band obtained 
by Lo (1983) using a Dirichlet prior and iid sampling. 


(3.12) Ê 


COROLLARY 3.2. Suppose lim, ,,,(n — 1) E, es(X, — X}? = 07. Under the 
assumptions of Theorem 3.1, 


lim p{( 2) fatty — X| <AlX,,8 € s| 


m,n oo 


= (2no?)"*/* exe(-5(=)) dx 


Corollary 3.2 partly extends the corollary of Scott (1971), who established the 
result for more general priors, but under the restriction that the X’s can only 
take values in a finite set. 


(3.13) 


REMARK 3.1. The conclusion of Corollary 3.2 also appears in Binder (1982, 
Section 2). However, Binder’s argument is based on a result of Scott (1971) and 
does not work: The definition of A, there implies that k grows with the sample 
size whereas Scott’s result is applicable for a fixed k only. 


An immediate consequence of Corollary 3.2 is that a (1 — a) large-sample 
confidence interval for the population mean {sH,(ds) is given by 


7 1-f\2 
(3.14) Xt as,(—] ‘ 


where S} = (n — 1)71L,¢9(X,— X)? is the sample variance and A is the 
(1 — a/2)-percentile point of a standard normal random variable. This large-sam- 


BAYESIAN STATISTICAL INFERENCE 1233 


ple Bayes confidence interval coincides with the well known large-sample con- 
fidence interval for the population mean from simple random sampling (Cochran 
(1977)). Deleting the fpc, the last interval coincides with the usual large-sample 
sample theorist confidence interval for a population mean obtained from iid 
sampling. 


4. Concluding remarks. The results in the previous sections can be applied 
to stratified population models. The idea is that each stratum can be treated as 
an independent population model and a Dirichlet-multinomial prior can be 
assigned to the population variables of the stratum in question. If the priors for 
different strata are assumed to be independent, the results in the previous 
sections hold for each stratum. These can be applied to justify the large-sample 
result in Binder (1982, Section 2.3). 


Acknowledgments. The author wishes to thank an Associate Editor and a 
referee for several helpful suggestions. 


REFERENCES 


BICKEL, P. J. and WICHURA, M. J. (1971). Convergence criteria for multiparameter stochastic 
processes and some applications. Ann. Math. Statist. 42 1656-1670. 

BINDER, D. A. (1982). Non-parametric Bayesian models for samples from finite populations. J. Roy. 
Statist. Soc. Ser. B 44 388-393. 

COCHRAN, W. G. (1977). Sampling Techniques. Wiley, New York. 

Ericson, W. A. (1969). Subjective Bayesian models in sampling finite populations. J. Roy. Statist. 
Soc. Ser. B 31 195-233. 

FERGUSON, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 
209-230. 

HOADLEY, B. (1969). The compound multinomial distribution and Bayesian analysis of categorical 
data from finite populations. J. Amer. Statist. Assoc. 64 216-229. 

JOHNSON, N. J. and Korz, S. (1969). Distribution Statistics: Discrete Distributions. Wiley, New 
York. 

Lo, A. Y. (1983). Weak convergence for Dirichlet processes. Sankhya Ser. A 45 105-111. 

MATTHES, K., KERSTAN, J. and MECKE, J. (1978). Infinitely Divisible Point Processes. Akademie- 
Verlag, Berlin/Wiley, New York. 

MOsIMANN, J. E. (1962), On the compound multinomial distribution, the multivariate £-distribution, 
and correlations among proportions, Biometrika 49 65-82. 

Scott, A. (1971). Large-sample posterior distributions for finite populations. Ann. Math. Statist. 42 

` 1113-1117. 


DEPARTMENT OF STATISTICS 
STATE UNIVERSITY OF NEW YORK 
BUFFALO, NEW YORK 14226 


The Annals of Statistica 
1986, Vol. 14, No. 3, 1234-1239 


THE TOTAL TIME ON TEST PLOT AND THE CUMULATIVE 
TOTAL TIME ON TEST STATISTIC FOR A 
COUNTING PROCESS 


By Ricuarp D. GILL 


Centrum voor Wiskunde en Informatica, Amsterdam 


Results on the total time on test plot are usually obtained on the 
assumption that the number of events to be observed is fixed in advance. Here 
it is shown that the same large sample results hold when the number of events 
is random if a simple condition is satisfied. 


Consider a univariate counting process on the time interval J= [0, r) or 
[0,7] c [0, co] with continuous compensator A. Suppose we are interested in 
testing the hypothesis: A = cA, for some unknown constant c > 0 and a given 
observed process A,. For example, suppose N has intensity process A with 
A(t) = A(t)¥(#) for some observable process Y and some unknown function À, 
and take A (£) = {{Y(s) ds. The hypothesis A = cA, corresponds to the interest- 
ing hypothesis À = constant. The total time on test plot and the cumulative total 
time on test statistic are two common techniques for investigating this hypothe- 
sis when the alternatives of special interest are that dA /dA, is monotone (i.e., in 
our example, A is monotone). They are based on the observation that in the new 
time scale measured by Ay, N is transformed into the process Ne Ag! which, 
under the null hypothesis, is a counting process with constant intensity c on the 
(random) time interval [0, A,(r)]; i.e., a randomly stopped Poisson process with 
constant intensity. Under the alternative it is a counting process with monotone 
intensity (dA/dA,)° Ao +. (See Aalen and Hoem, 1978.) 

The total time on test plot is usually carried out as follows: Choose some 
number of events R (in classical applications R is a fixed number r say, but this 
is not in general possible’) such that Tg < r almost surely (here 0 < T, < T, < 

- are the jump times of N; say T,= r for all 1 > N(7)), and make a plot of 
A,(T,)/A,(Tp) versus i/R, i= 0,1,..., R. Under the null hypothesis the plot 
should approximate the straight line y = x, x € [0,1]. Under the alternative it 
tends to be concave or convex depending on whether dA/dA, is decreasing or 
increasing. The cumulative total time on test statistic is the quantity 
ZEA (T.)/A (Tp). The standardized version of the statistic corresponds to the 
(signed) area between total time on test plot and the line y= x (using an 
appropriate interpolation convention between consecutive plotting positions). 


Received March 1986; revised October 1985. 

'This point is often overlooked; cf. Barlow and Proschan (1969) and many later authors 

AMS 1980 subject classifications. 62M99, 62N05, 62P10. 

Key words and phrases. Total time on test plot, cumulative total tıme on test statistic, testing 
exponentiality, random time change, counting process. 


1234 


TOTAL TIME ON TEST FOR A COUNTING PROCESS 1235 


More generally one could think of choosing some random time T (not neces- 
sarily even a stopping time, e.g., the last jump time before some fixed time Ł) and 
plotting A,(t)/A,(T) against N(t)/N(T), t € [0, T]. Taking T = Tp gives the 
previous plot with a particular interpolation convention. We shall develop some 
asymptotic null-hypothesis distribution theory for this plot. From this corre- 
sponding results for the statistic follow immediately. 

Now if N(r) = r with probability 1, and we take T = T,, exact distributional 
results are available for plot and statistic since N° A>’, stopped at A,(T,), is 
simply (under the null hypothesis) a Poisson process with constant intensity c 
stopped at the rth event. The plot has the same distribution as the empirical d.f. 
based on r — 1 i.i.d. uniform [0,1] r.v.’s. Large sample results (i.e., as r — 00) for 
plot and statistic are now immediately available: we have a Brownian bridge and 
the signed area beneath a Brownian bridge, respectively. 

So we shall consider here the case when T is chosen as arbitrarily as possible, 
not even necessarily as a stopping time. Consider a sequence of situations indexed 
by n, all under the null hypothesis with A = A” = cA? for a fixed constant c. 
Our results are obtained under the following simple assumption on the times 
T=T", 


ASSUMPTION. Suppose T” is such that there exists a sequence of constants 
a, — 0 as n — œ such that 


At(T")/a, >a a E (0,00) asn > oo. 


Consider the processes N"o(A‘%/a,)~!. These have intensity ca, on 
[0, A§(r)/a,,]. At this final time instant Af(r)/a,, (if it is finite) start up an 
independent Poisson process with constant intensity ca, and fasten it onto 
N"0(A%/a,)~*. In this way we obtain a process U” say, coinciding with 
N”0(A?/a,)~ ‘on [0, A3(r)/a,,], with intensity ca, on the whole line. So U” is 
a Poisson process and we have easily 


U" 
ea - a| -a cW in D[0, 0), 


n 


where W is a standard Wiener process and J is the identity function I(x) = x. 
Now look at the process V” defined by 


V(x) = ut), x € [0,1], 


i.e., on [0,1] 


A’ -1 AUT" 
rm (BE 
a a 


n n 


= N” o( A8) *0(Ag(T")I) 


since A(T”) < A(T) as.; i.e., we do not run into the appended Poisson process. 


1236 ‘ RD. GILL 


Recall that (A3(T'"))/a,, >p a. Since 
——— 7 cI >g 0, 
we obtain N"(T")/a, >p ca. 
By a change of time argument (see Appendix) we have in D[0,1] 


aA = 5 at} r) >a "AWe (a: I) =o Wlea- I) 


a, 
i.e., 


(a yo oe Ault") : 
is a a 


n n 


r) > W(ca- I). 


Since W has continuous paths, the process obtained from the process on the 
left-hand side by subtracting the straight line connecting its end points (at x = 0 
and x = 1) also converges in distribution; i.e., 


Veh, 


(e| a I| > W(ca- I) — W(ca)- I 


=gvca B°, 
where B° is a Brownian bridge on [0,1]. We have obtained, therefore, 
(an) "P(N" o (A8)~* e (AG(T") - 1) - N*(T") - I) >a (ca)? Be, 
from which it easily follows that 





n n 











N” o(A}) > o(AR(T”)- I) 
nào pny /2 0 0 0 
N"(T") NT") I| >, B°. 
This is the required result since a plot of 
N” o(A2) > o(AR(T”)- x 
(ADT (ART) A) ainet a 


N”(T”) 
is a plot of N”(t)/N"(T") against Afj(t)/Aj(T”) (replace x by this last 
quantity). 

Thus the total time on test plot (under the null hypothesis) has the same 
asymptotic distribution as the uniform empirical d.f., taking N”(T'") as the 
number of observations. 

We obtain immediately that the asymptotic distribution of the signed area 
between total time on test plot and diagonal, times N”(T")'”*, is the same as 
that of {jB° dx =, (0,4), which is the required result on the cumulative total 
time on test statistic. 


EXAMPLES. Suppose there exist a, > œ as n > œ such that P(N"(7) 2 
a,,) > 1. Then we can consider the total time on test plot for the first a, events; 


TOTAL TIME ON TEST FOR A COUNTING PROCESS 1237 


ie., take T” = Tg. We show how we can recover the classical asymptotic results 
on the total time on test statistic from our general result. To simplify the 
discussion suppose actually N”(r) >a, almost surely for each n. We have 
N"(T") = a,, almost surely. By the properties of a compensator of a counting 
process, we have 


&((N*(T") — cA3(T*))’) = (cAg(T")) = €(N"(T")) = an. 


Thus &((AQ(T"))/a, — 1/c)*) =1/a,c? > 0 as n— oœ and our condition 
A?(T")/a, >p a = c’ is satisfied. 

More donarally, one can check that if T” is a stopping time for each n and 
a,, > œ satisfies both N"(T")/a, >» 1, and 


é(N"(T"))/(a,) >0 asn— œ, 
then 
At(T")/a, >p c> apn oo. 


For a second example, consider the classical random censorship model 
(X, ô.) = (min(X,, C,), KX, < C,}),i = 1,..., n, where X,,..., X, and C,,...,C, 
are all independent and nonnegative, the X,’s with absolutely continuous 
distribution function F and the C,’s with distribution function G. Define 
Nt) = #{i: X,< t,8,=1} and Y\(t)= #{i: ¥,> t}. Then N” is a count- 
ing process with intensity A” = Y”-A where À is the hazard rate of F. Taking 
Aj(t) = {fY"(s) ds, we obtain a plot and a test statistic for testing exponential- 
ity of F versus alternatives of an increasing or decreasing hazard rate. 

If we use all the observations, i.e., take T” = œ, we see that 


AG(T")/n >a f (L — F(s))(1 ~ G(s)) ds = ($) < 6(X,) < œ 


when A is constant, so our conditions are satisfied with a, = n. 

This example brings up the question as to whether the total time on test plot 
described here is the appropriate generalization from the uncensored to the 
censored case. As n — co the plot converges to the curve obtained by plotting 


oll — G(s —)) dF(s) 
IEC — G(s —)) dF(s) 
against 
fol — G(s —)) — F(s)) ds 
for. — G(s —))(1 — F(s)) ds’ 
which depends heavily on the censoring distribution G (though to be sure it is 
convex or concave according to whether A is decreasing or increasing). If one is 
really more interested in estimating the curve obtained when G = 0, then one 
would do this by replacing F by the product-limit estimator and deleting G. 


Unfortunately the asymptotic distribution theory becomes rather more com- 
plicated then (see Gill, 1983). 


te [0, 00), 


1238 R. D. GILL 


REMARKS. One might hope that N”(T”) >p œ or A(T”) >p œ would be 
sufficient to prove our weak convergence result. However, this is easily seen not 
to be the case. Suppose for instance N” is a standard Poisson process, Aj = I, 
c = 1, and let T” be the stopping time, the first time after the nth event that the 
cumulative total time on test statistic takes a positive value. One can show that 
T” < œ with probability one for each n. Obviously we cannot now have weak 
convergence to a Brownian bridge. 

Two-sample versions of the total time on test plot and statistic are introduced 
by Gill and Schumacher (1985). Other recent related work has been done by Arjas 
and Haara (1985) and Arjas (1985). 


APPENDIX 
A lemma on random time change. 


LEMMA. Suppose X” >, X in D[0,0] and T” >p T < o as n -> œ where X 
has continuous sample paths and T” € [0,0] for all n almost surely. Then 
Y” = X"0(T" . I) satisfies Y” >, X °(r - I) in D[0,1]. 


Proor. By a Skorohod—Dudley construction (cf. Vervaat (1972) for a state- 
ment and nice application of this) and continuity of the paths of X, we may 
suppose that we have, on a single sample space, 


IX” -— XI], 70 as. 
T” >rt a8. 
where ||- ||, is the supremum norm on [0, a]. Immediately we have 
|X"0(T"- I) — Xe(T"-I)|, 70 as. 
We must check 
|Xeo(T"-I) - Xo(7-I)], 70 as. 
But 
|Xe(T"-1) - Xe(r-T)Il, 
= sup |X(T"x) — X(rx)| 


xe€[0,1] 


< sup |X(u) — X(o)| 


u, 0: |u—o|<|T*—7| 
>00 as. an> œ, 
since the paths of X are continuous. We now have 


|\X70(T"-IT)—Xe(r-I)|], 70 as. 
which implies 
X"0(T"- 1) +g X0(1r- I) 

in D[0,1]. 0 


TOTAL TIME ON TEST FOR A COUNTING PROCESS 1239 


Nore. The lemma also holds when the closed interval [0, 0] is replaced by 
the semiopen interval [0,0), 0 < o < œ. Just run through the above proof 
replacing o by 0’, T < 0’ <o. 


REFERENCES 


AALEN, O. O. and HoeEM, J. M., (1978). Random time changes for multivariate counting processes. 
Scand. Actuarial J. 81-101. 

Arysas, E. (1985). On graphical methods for assessing goodness of fit in Cox’s proportional hazards 
model. Preprint, Univ. of Oulu. 

Argas, E. and Haara, P. (1985). A note on the exponentiality of total hazardi before failure. 
Preprint, Univ. of Oulu. 

BarLow, R. E. and PRO8SCHAN, F. (1969). A note on tests for monotone failure rate based on 
incomplete data. Ann. Math. Statist. 40 595-600. 

GILL, R. D. (1983). Large sample behaviour of the product-limit estimator on the whole line. Ann. 
Statst. 11 49-58. 

GILL, R. D. and SCHUMACHER, M. (1985). A simple test of the proportional hazards assumption. 
Report MS-R8504, Centre for Mathematics and Computer Science, Amsterdam. 

VERVAAT, W. (1972). Functional central limit theorems for processes with positive drift and their 
inverses. Z. Wahrsch. verw. Gebiete 23 245-263. 


CENTRUM VOOR WISKUNDE EN INFORMATICA 
Posrsus 4079 

1009 AB AMSTERDAM 

THE NETHERLANDS 


The Annals of Statietics 
1986, Vol. 14, No 3, 1240-1245 


LOCAL CONVERGENCE OF EMPIRICAL MEASURES IN THE 
RANDOM CENSORSHIP SITUATION WITH APPLICATION TO 
DENSITY AND RATE ESTIMATORS 


By HELMUT SCHÄFER 
University of Heidelberg 


In this paper, we study the local deviations of the empirical measure 
defined by the Kaplan-Meier (1958) estımator for the survival function. The 
results are applied to derive best rates of convergence for kernel estimators for 
the density and hazard rate function in the random censorship model. 


1. Introduction. In the random censorship model, instead of the random 
variables T, of interest, one observes variables X, = min(T,,C,) and indicators 
ô, = I(T, < C,), i € N. The T; are i.i.d. and nonnegative, and so are the censoring 
variables C,. T, and C, are assumed to be independent for all i. 

In this situation, the product-limit estimator F, introduced by Kaplan and 
Meier (1958) is widely used to estimate the distribution (survival) function 
F(x) = P(T > x) from the observations. Földes and Rejtö (1981) proved strong 
uniform convergence of this estimator with rate of O(/log(n)/n). In many 
applications, however, the convergence of the empirical measure dF, is only 
needed locally, i.e., on intervals J,, C R with probability mass p, tending to 0 as 
the sample size n increases. Exploiting the faster decrease of the variance of 
fı, dFa» it is possible to derive smaller rates of convergence on such sets of 
intervals. For the empirical probability measure defined by i.id. random varia- 
bles, Stute (1982a) proves sup, aF < p| Jr dEn — Jr dF| = O(/p,log(n)/n) as. un- 
der appropriate conditions on the sequence p, —> 0, where the sup is taken over 
all intervals J c R with probability mass < p,. 

In the present paper, we show that this result remains true for the 
Kaplan-Meier estimator F, in the random censorship model. In other words, we 
study the local oscillation behaviour of a certain empirical process F (t) — F(t), 
tE R, having dependent increments. The estimator H, for the cumulative 
hazard function H(x) = {j(dF(t))/(F(#)), introduced by Nelson (1969), may be 
treated the same way. Indeed, the Kaplan—Meier estimator is even somewhat 
clumsier. 

In Section 3, these results are applied to kernel estimators for the density 
function f and the hazard rate k of T. In a general form, these estimators may 
be written as 


f(x) = [RIEK (x — £)/R,(t)) a(t) 
(for density estimation) with a kernel function K integrating to 1 and a random 


Received February 1985; revised November 1985. 

AMS 1980 subject classifications. 62G05, 62P10. 

Key words and phrases. Random censorship model, empirical measures, convergence rates, kernel 
density estimation, sample-point-dependent bandwidths. 


1240 


CONVERGENCE OF EMPIRICAL MEASURES 1241 


process R,. For deterministic fixed bandwidths R,(t)=d,— 0, asymptotic 
properties of hazard rate estimators were studied by Ramlau-Hansen (1983), 
Tanner and Wong (1983), and Yandell (1983). Tanner (1983) showed pointwise 
consistency for random bandwidths R,, depending on the point of interest x. 
Schafer (1985) proved strong uniform consistency for an estimator with band- 
widths depending on the sample point 


R,({t) = inf{r > 0|F,(t — r/2) — F(t + r/2) 2 p,} 


(Pp, — 0 again a sequence of positive real numbers), modelled upon the variable 
kernel estimator of Breiman, Meisel, and Purcell (1977) and Victor (1976). 

It is well known that kernel estimators (with fixed bandwidths) converge 
uniformly with a rate of O(ylog(7) /(np,) + Pn) in the uncensored case (Silver- 
man (1978); Stute (1982b)). We show that the same rate holds for randomly 
censored data, taking the example of the aforementioned data-adaptive estima- 
tor, which has found little attention in the literature. We remain in the context of 
density estimation. For hazard rates, simply replace the Kaplan—Meier estimator 
F, by the Nelson estimator H,, and apply the corresponding convergence result. 


2. Notation and assumptions. We make the usual general assumptions: 
The distribution (survival) functions F(x) = P(T > x) of T and G(x) = P(C > x) 
of C and the subsurvival function F(x) = P(X >x and ô= 1) are contin- 
uous. The results are restricted to an interval [0,B] with B such that 
0 < F(B)G(B) < 1. T is supposed to have a density function f which is positive 
and bounded on [0, B] (i.e., 0 < m < f < M) and satisfies a Lipschitz condition 
| f(x) = f(9)| < Lyle — yl for x, y € [0, B]. 

For density estimation, the kernel function K is assumed to have compact 
support on [— 4,4], to integrate to 1, and to satisfy also a Lipschitz condition 
with constant denoted by Ly. (This might be replaced by monotonicity condi- 
tions, for example.) The sequence (p,,) must be chosen such that p, > 0 and 
np,„/log(n) > o. 

In the absence of ties, the Kaplan-Meier estimator is well defined by 


F(x) = JT (M) ~ D/Na(%))" 


1% 


with N (x) = E3 IX, 2 x) the number of individuals at risk at time x — 0. We 
denote by G,, the ahi aaa estimator for the censoring curve G, obtained by 
substituting 1 — ô, for 6,. F, denotes the canonical estimator for F defined by 
E(x) = 1/n07,8,MX, 2 x). 

Finally, by abuse of notation, we write F(A) = {, dF for the measure defined 
by any distribution function F. 


3. Local convergence of F,. Clearly, representing F, as a sum of random 
variables suitable for approximation by i.i.d. ones is basic to the aim of investigat- 
ing local properties of F,. An easy transformation of F(X — 0) ~ F,(X,,)) for 
an uncensored observation X,„ using the above definition of F, yields the 


1242 H. SCHÄFER 


representation 
dF, = G7) dF, 

which is fundamental to our procedure. We will approximate dF, by 
dF* = G`! dF 


using the result by Földes and Rejtö (1981) cited in the introduction. 
THEOREM 3.1. ForOspsland0<e<1, 


P| sup |F*(I) — F(I)|> e) < Ce~*exp(—Cne*(p + e)™°), 
FU) sp 
with constants C only depending on F and G. The sup ts taken over all intervals 
Ic [0, B] with mass less than or equal to p. 


Proor. For a fixed single I C [0, B], F*(I) — F(1) is the mean Y, of n ii.d. 
random variables distributed as Y= G \T)I(T < C)l(T € I) — F(I). The 
calculation of the expectation and the variance is straightforward (use the 
independence of T and C and f,» ,G~*(t) dG(c) = 1): 


E(Y) =0, 
o? = var(Y) = f,G~\(t) dF(t) — F(T) < bF(1), 
|Y| < b, 


where b := G~1(B). In this situation, it is standard to use the exponential bound 
P(\F,*(I) — F(1)| > e) < 2exp(—ne?/2(0? + be)) 


derived from the inequality of Bernstein (1924) as cited by Bennett (1962). 
Now consider the finite system of compact intervals 


S = {|F (ie), F- je)] li = 0,..., 67} +1; j= i... i + pen! + 3}. 


(Take the integer part of e71, pe~! and always F~ (ie) < F~'( je) < B). 

Since F(I) <p+3e for every IGS, and so o? < bp + 3be, and since 
card (S) s const/e?, we obtain the desired exponential bound for 
P(sup; cg|/F*(1) — F(J)| > e). It remains to remark that the inverse of this last 
inequality implies |F*(I) — F(1)| < 3e for all intervals I with F(J) < p: Indeed, 
for such J there exist J, I, € 8S with I, CI cI, and MI,)-2e< F(I) s 
F(L) + 2e. 0 


COROLLARY 3.2. Let p, > 0 with np,/log(n) > œ. Then 


(1) sup |F*(1) - F(1)| = O(log(n) pn?) a.s. 
F(I) SPa 

(2) sup |F(I) — F*(1)| = O(log(n)'’?p,n-7) a.s. 
FD) SPa 


Again, the sup are taken over all intervals with mass < p,,. 


CONVERGENCE OF EMPIRICAL MEASURES 1243 


Proor. The condition on p, implies p, + log(n)/2p)}n-1 = O(p,). (1) 
then follows from the theorem by Borel—Cantelli. For (2), 


IFCI) -= Bt) = f JG- Gaf, 


= f |G; — G7GdF* < sup|G, — GIG; (B)F*(1). 
I I 


By Theorem 3.2 of Földes and Rejtö (1981), sup,|G,, — G| = Ofog(n)/?7n-/*) 
and G7 {B) = b + Ologin)} n1). By (1), 
Fx(1) = p, + O(log(n)? pln“) = O(p,)-0 
4. Convergence results for the variable kernel estimator. Let r (t) = 


inf{r > O|F(t — r/2) — F(t + r/2) 2 pa} be the deterministic analogue of the 
bandwidths R „(Ł) defined in Section 1, and define 


falx) = fra (e)K((x — t)/r,(t)) dF (t) 
and 


felx) = fra (E)E (x — t)/r,(t)) d(e). 


THEOREM 4.1. Let O<a< b< B. Under the conditions assumed in Sec- 
tion 2 


(3) sup Ihala) ~ fala) = O(iog(n) n pr”) a.s, 
(4) sup, f(a) = fC) = O(iog(n) np”) a.s. 
(5) E ee) — f(x)| = O( pa). 


Proor. The following statements are valid uniformly in x € [a, b] (con- 
stants depend on f and K only) for large n, and with probability 1 as far as 
random variables are involved. The result of Corollary 3.2. is used for 2Mp,,/m 
(see Section 2 for notations) in the place of p,: 


(*) sup \F,[x, y] — Fl x, y]| < Clog(n) pin”. 
\z—y|s2Mp,/m 

The sequence on the right-hand side will be denoted by e,, in the following. Note 

&,/P, > 0 by hypothesis on p,. For x €[a,b], put a,(x) =x — 2p,/m, 

b(x) =x + 2p,/m, and I(x) = [a,(x), b,(x)]. Since, for example, 

F[x, },(x)] = 2p, and, by (*), F,[x, 6,(x)] > Pa: we get |x — t| > r (t)/2 and 


1244 H. SCHAFER 


jx — t| > R,(t)/2 for t € I(x). Due to Supp(K ) c [— 4, 4], the integrals defin- 
ing f,(x), f(x), and f,2(x) may thus be restricted to integration over I(x). 
(3) (*) implies 
ct , Halt — aC tat) + en/m),t + (r(t) + en/m)] > Pp 


and the corresponding upper bound for the interval defined by subtracting £„/m. 
Hence, by definition of F(t), 8UP; epa, b] Rnt) — TACE) S &,/m. In combination 
with boundedness of 7,,(¢)/p,, and of K, and the Lipschitz condition for K, this 
shows 

ep IEG ~ t)/R,(t)) — ry (t)K((x — £)/7,(¢))| = O(en pa?) 

El (x 
(insert the mixed term r; ({t)K((x — t)/R,(t))). This difference has to be in- 
tegrated over I,(x), so (8) follows by F,(J,(x)) = O(p,) (use (*) again). 

(4) We abbreviate ki(t) = ry '(t)K (x — t)/r,(t)) and až(t) = F,[a,(x), t] - 
F[a,(x), t] for t € I(x). Obviously, |7,(t) — 7,(8)| < [t — s| for all ¢, s € [a, b]. 
Combined with r, 2 p,/M, this implies that the total variation V; «(7n }) of 
1/r, over I,(x) is bounded by 4M*/mp,. Repeated application of the for- 
mulae V(uv) < sup|u|V(o) + supjo|V(u) and V(K°u) < Lg: Viu) leads to 
Vi (k3) = O( p; '). Now, by the standard argument using signed measures and 
integration by parts, 

J, 


| fea) - f(x) |= 





axle) dei(2) 


-|- fa dkž(t) 


< sup |oz(t)|Vico(n)- 


telix) 
The above bound for V(*) and (*) for the first factor complete the proof. 
(5) The Lipschitz condition for f yields the further approximation 


sup |p,/f(x) - r,(t)| = O( pè) 


telz) 
for the bandwidths. Now write 


Hes J POVP) = #))4(#) at 





and 
flx) = I oe (OK (ae Oe — t)) f(t) dt 
and use this approximation together with | f(t) — f(x)| = O(p,) for t € 1,(x).O 


REFERENCES 


AALEN, O. (1978). Nonparametric estimation of partial transition probabilities in multiple decrement 
models. Ann. Statist. 6 534-545. 

BENNETT, G. (1962). Probability inequalities for the sum of independent random variables. J. Amer. 
Statist. Assoc. §7 33-46. 


CONVERGENCE OF EMPIRICAL MEASURES 1245 


BERNSTEIN, S. (1924). Sur une modification de l’inéqualité de Tchebichef. Ann. Sct. Inst. Sav. 
Ukraine, Sect. Math. 1. 

BREMAN, L., MEISEL, W. and PURCELL, E. (1977). Variable kernel estimates of multivariate 
densities. Technometrics 19 135-144. 

FÖLDES, A. and Resto, L. (1981). Strong uniform consistency for nonparametric survival curve 
estimators from randomly censored data. Ann. Statist, 9 122-129. 

KAPLAN, E. L. and MEIER, P. (1958). Nonparametric estimation from incomplete observations. J. 
Amer. Statist. Assoc. 83 457-481. 

NELSon, W. (1969). Hazard plotting for incomplete failure data. J. Qual. Tech. 1 27-52. 

RAMLAU-HANSEN, H. (1983). Smoothing counting process intensities by means of kernel functions. 
Ann, Statst. 11 453-466. 

ScHAFER, H. (1985). A note on data-adaptive kernel estimation of the hazard and density function in 
the random censorship situation. Ann. Statst. 18 818-820. 

SILVERMAN, B. W. (1978). Weak and strong uniform consistency of the kernel estimate of a density 
and its derivatives. Ann. Statist. 6 177-184. 

Struts, W. (1982a). The oscillation behaviour of empirical processes. Ann. Probab. 10 86-107. 

STUTE, W. (1982b). A law of the logarithm for kernel density estimators. Ann. Probab. 10 414-422. 

TANNER, M. A. (1983). A note on the variable kernel estimator of the hazard function from randomly 
censored data. Ann. Statist. 11 994-998. 

TANNER. M. A. and Wona, W. H. (1983). The estimation of the hazard function from randomly 
censored data by the kernel method. Ann. Statist. 11 989-998. 

Vicror, N. (1976). Non-parametric allocation rules. In Deciston Making and Medical Care (De 
Dombal and Grémy, eds.) 515-527. North Holland, Amsterdam. 

YANDELL, B. S. (1983), Non-parametric inference for rates and densities with censored serial data. 
Ann, Statst. 11 1119-1136. 


INSTITUT FOR MEDIZINISCHE 
STATISTIK UND DOKUMENTATION 
UNIVERSITÄT HEIDELBERG 

IM NEUENHEIMER FELD 325 
D-6900 HEIDELBERG 

WEST GERMANY 


The Annals of Statiutics 
1986, Vol. 14, No. 3, 1248-1251 


BAHADUR REPRESENTATIONS FOR ROBUST SCALE 
ESTIMATORS BASED ON REGRESSION RESIDUALS 


By A. H. WELSH 
University of Chicago 

We investigate the asymptotic behaviour of the median deviation and the 
semi-interquartile range based on the residuals from a linear regression model 
by deriving weak asymptotic representations for the estimators. These repre- 
sentations may be used to obtain a variety of central limit theorems and yield 
conditions under which the median deviation and the semi-interquartile range 
are asymptotically equivalent. The results justify the use of the estimators as 
concommitant scale estimators in the general scale equivariant M-estimation 
of a regression parameter problem. Finally, the results contain as a special 
case those obtained by Hall and Welsh (1985) for independent and identically 
distributed random variables. 


1. Introduction. In this paper, we investigate the asymptotic properties of 
two popular robust scale estimators, the median absolute deviation from the 
median (sometimes called MAD or, at least since Hampel (1974), the median 
deviation) and the semi-interquartile range, applied to the residuals from a linear 
regression model. An important (but not the only) motivation is the problem of 
concommitant scale estimation in M-estimation. 

Suppose that we observe Y,,..., Y, where 


(1.1) Y,=x/Ot+e, 1sjs<n, 


with {x, = (%,1,---,%,p)’'} a sequence of known p-vectors (p > 1), % a unique 
unknown regression parameter to be estimated, and {e,} a sequence of indepen- 
dent and identically distributed random variables with unknown distribution 
function F. Relles (1968) and Huber (1973) investigated the class of M estimators 
of the regression parameter 4, as solutions of equations of the form 


Dd «,¥(¥, — x0) = 0, 

j=l 
where y: R — R. In general, scale equivariant M-estimators may be obtained by 
calculating a location invariant and scale equivariant scale estimator o,, from the 
data and then solving the system of equations 


È 2,W((¥; - 238) /o,) = 0. 
j=l 


Huber (1964) made three proposals for obtaining a suitable scale estimator o,,. 


Received April 1985; revised October 1985. 

AMS 1980 subject classifications. Primary 62F35, 60F05; secondary 62G30. 

Key words and phrases. Linear regression, median deviation, quantiles, robust estimation, scale 
estimation, semi-interquartile range. 


1246 


ROBUST SCALE ESTIMATORS 1247 


The asymptotic theory of the estimators resulting from proposals 1 and 2 may be 
derived from the results of Huber (1967); proposal 3 has proved efficacious for a 
particular M-estimator in the regression problem (see Welsh (1985) for references) 
and has been investigated for M-estimators in the location subproblem by Bell 
(1980). Another conceptually and computationally simple approach is to apply an 
explicit robust scale estimator to the residuals. This procedure is frequently 
adopted in the location subproblem for which the median is a natural initial 
estimator. The small sample performance of the resulting regression M-estima- 
tors has been investigated by Holland and Welsch (1977), Hill and Holland 
(1977), and Denby and Mallows (1977). The asymptotic results of Bickel (1975) 
and Yohai and Maronna (1979), which implicitly assume that F is symmetric, are 
applicable provided there exists a scale estimator o, such that n!/*(o, — op) is 
bounded in probability for some o, > 0. However, the results are incomplete in 
that no robust scale estimator satisfying the above requirement has been 
exhibited. We will show in this paper that under very mild conditions, the 
median deviation and the semi-interquartile range are appropriately bounded in 
probability. 

Our main result (Theorem 2) yields a central limit theorem for the median 
deviation in the general asymmetric case and establishes that (Corollary 2.1) 
under mild conditions implied by symmetry, the semi-interquartile range is 
asymptotically equivalent to the median deviation. 


2. Results. We assume throughout that the model (1.1) holds so that, for 
any 6 € R?, the residuals from @ are 


e,(0) =Y -x0 =e,- x(0-0), I1sjsn, 


almost surely, and e,(6)) = e,, 1 <j < n, almost surely. Under mild smoothness 
conditions on F, the results below hold for any initial regression parameter 
estimator b, such that 


(Ci) n'/2(@,— ,) is bounded in probability, 
provided that 


x, =1, 1<jsn and there exists a positive definite matrix A 
(Cii) such that lim n`? J, xxi =A 


n> 0 j=l 
holds. Condition (Ci) holds for a large class of estimators including M-estimators. 
It is convenient but not necessary to assume that @, estimates an intercept; with 
z; = (1, x4), 1 <j <n, instead of x,, 1 <j <n in condition (Cii), the results 
below still hold. For examples of estimators which do not estimate an additive 
main effect, see Jaeckel (1972) and Welsh (1985). 

We derive weak Bahadur representations for the median deviation and the 
semi-interquartile range through the sample quantiles. For 6 € R? in the neigh- 


1248 A. H. WELSH 
borhood of 6, let 
F(x,6)=n'¥ He(@)<x}, xeER, 


j=l 
where J is the indicator function. For simplicity, set F(x, 6)) = F(x). The qth 
sample quantile &, ,(6) is defined by 


F,(§nq(9)»9) = 9, 
and the population quantile $, is defined by 
F(é,)=9, 0<q<1. 


The first result which is of independent interest, slightly generalises Lemma 1 
of Ruppert and Carroll (1980) by weakening their first condition. The result may 
be proved by an extension of the argument of Ghosh (1971) with Lemma 4.1 of 
Bickel (1975) or using results of Pierce and Kopecky (1979) or Loynes (1980) or by 
modifying the argument of Ruppert and Carroll (1980). The proof is omitted. 


THEOREM 1. Suppose that conditions C hold and the derivative of F exists 
in a neighborhood of Ẹ,, is continuous at £,, and F'(Ẹ,) > 0. Then for fixed q, 
0<g<1, 


172 (Falta) - 9} 


n/H Enla) — A +n P(e) 


+ n'E"(0, — b) >p 0. 


In the special case that 4 is known, conditions C are redundant and the 
smoothness condition can be weakened to yield the result of Ghosh (1971). The 
representation in Theorem 1 is useful in determining the joint asymptotic 
distribution of any finite number of fixed quantiles, possibly in conjunction with 
other statistics. In particular, we immediately obtain a Bahadur representation 
for the semi-interquartile range. 


COROLLARY 1.1. Suppose that conditions C hold and that the smoothness 
condition of Theorem 1 holds for q =3/4 and q=1/4. Let Q,(6@,) = 
{En 8748n) = En 1748n) 3/2 and Q = {$374 oT €1/4}/2. Then 


nQ (@ ) -Q } + pi/rtal basa) — 3/4 = F(§1/4) —1/4 





> 


2F (és) 2F’(E, 4) ai 


The next result, the Bahadur representation for the median deviation, is the 
main result of this paper. The sample median deviation S,(4,, €,,;2(9,)) is the 
median of |e,(9,) — &,,1/2(9,)|,1 SJ < n, while the population median deviation 
8o > 0 is defined by F(é, 2 + 89) — Ffi- 89 —) =}: 


THEOREM 2. Suppose that conditions C hold and 


(i) Fix + x) exists for x in a neighborhood of the origin and is continu- 
ous and positive at x = 0; 


ROBUST SCALE ESTIMATORS 1249 


(li) F'($ 72 + So + x) exists for x in a neighborhood of the origin and is 
continuous at x = 0; 
(iii) Fm, + x) + Fm, — x)-> 0 for x in a neighborhood of 8. 


Then 
nV S (0n, age Ce) a 80} 
+n F (ê + 89) — F612 6p) = 1} /, 


—-n'{ F ($i) 7 1} {g./e.F'(b12)} >p 0, 
where g, = Fij + 80) + F(§2 — 89) andg, = Fé, 2. + 8) — F(E 80) 


OUTLINE OF PROOF. Let my = (Om + n,1/2( bn) Onz- >- np) and To = (I + 
iza 909)--+»9,) 80 the median deviation is the median of |Y,- xj1,| , 
1<js<n, and S,(6,,£,,12(9)) = S,(7,). For r E RP in a neighborhood of 1, 
put 


G(z,c)en¥ {F(xir+2)~F(xir-z-)}, 220. 


By condition (iii), for 7 in a neighborhood of 1, s,(7) defined by G,(s,(7), 7) = 4 
is unique and 8,(1)) = 8). Now 

n'/7(S,( 1%) u 80} = nÒ S (Ta) = 8,(%)} + n {sna Tn) = 89}. 
The first term may be handled by a modified version of the proof of Theorem 3 of 
Hall and Welsh (1985) with Lemma 4.1 of Bickel (1975), while the second term 
may be handled by a one-term Taylor series expansion since s,(7) is differentia- 
ble at 7 and s/(1) = —(g2/g,)* at m. O 


If (1.1) does not include an intercept, the above argument applies with 
Te = (En,1/29n) 4) E RPH, 13 = (£12, 06) © R?*1, and x, replaced by z, 
lsjsn. 

If the regression parameter 0, is known, the result generalises Corollary 3.1 of 
Hall and Welsh (1985) by discarding their fourth condition. The resulting 
Bahadur representation provides an alternative derivation of the central limit 
theorem (Theorem 2) of Hall and Welsh. For the present problem, the condition- 
ing argument used by Hall and Welsh to prove their central limit theorem is of 
limited utility because conditioning on the estimate of the regression parameter 
does not simplify the structure of the problem. 

Combining Corollary 3.1 and Theorem 2 we obtain the following analogue of 
Corollary 3.1.1 of Hall and Welsh (1985). 


COROLLARY 2.1. Suppose that in addition to the conditions of Theorem 2, | 
1 ae FE, 2 + 8o) = Kf n mad 80 iag ) and F', z + So) — Fin =e So). Then 
nÒ Sa Ons En,172(0a)) ~ Qalan) } >P 0. 


The above conditions hold if F is symmetric, has connected support, and a 
positive, continuous density on its support. 


1260 A. H. WELSH 


If we include an intercept in (1.1) and assume that $; . = 0 so the underlying 
error distribution F' is centered about the origin, we may consider the alternative 
scale estimator R,(,), the median of |e,(6,)|,1 <j < n. Note that the estimator 
F,(6,) is only location invariant if 6, includes an intercept estimator but @,,(8,) 
and S,(9,,€n,1/2(9,)) are location invariant whether or not 6, includes an 
intercept estimator. Specifically, if 6, = (a,, BLY, a, E R an intercept estimator, 
and 8, € R?~, then Q,(4,) = Qn(Bq) and S,(8,, Èn, 17200n)) = Sal Bro &n,1/2 Bad) 
if a, is the median of Y, - x/8,,1 <j <n, then R,(8,) = S,(6,, En, 1720n). If 
the conditions of Theorem 2 hold, then, by the same argument as that used to 
derive Theorem 2, it follows that 


{F, (7) - sh Ga 9 — 1/2} 


n¥{R(6,) — ro} + ni? F'(m) + F-n) 





{F (7) F ( To)} ng 50; 

F(1) + F(-1) 

where r) > 0 is defined by Fr) — F(—r) = 4. Moreover, it then follows that if 

in addition F is symmetric about the origin, S,(8,, £,,1/2(9,)), Ra(9,) and Q (On) 
are all asymptotically equivalent. 

Finally, many regression parameter estimators (including least squares, least 

absolute deviations, and other M estimators) admit a representation of the form 


n 
n¥?(6,- 0) -na L Avsir(e,) —p 0, 
y= 


for some real function y. It is straightforward to use such a representation and 
the results of the present paper to obtain central limit theorems for 


Engan) O < q < 1, (9, Sp Ons Èn, 129)’, (En, 17 bn) Spln bn, 1/9,))’, etc. 


Acknowledgments. I am grateful to David Pollard and Stephen Stigler for 
helpful conversations and to a referee and Associate Editor for comments which 
improved the presentation of this paper. 


REFERENCES 


BAHADUR, R. R. (1966). A note on quantiles in large samples. Ann. Math. Statst. 37 577-580. 

BELL, R. M. (1980). An adaptive choice of the scale parameter for M-estimators. Technical Report, 
Stanford Univ. 

BICKEL, P. J. (1975). One-step Huber estimates in the linear model. J. Amer. Statist. Assoc. 70 
428-434, 

Densy, L. and MALLows, C. L. (1977). Two diagnostic displays for robust regression analysis. 
Technometrics 19 1-13. 

Guosu, J. K. (1971). A new proof of the Bahadur representation of quantiles and an application. 
Ann. Math. Statist. 42 1957-1961. 

HALL, P. and WELSH, A. H. (1985). Limit theorems for the median deviation. Ann. Inst. Statist. 
Math. 37 27-36. 

HAMPEL, F, R. (1974). The influence curve and its role in robust estimation. J. Amer. Stahst. Assoc. 
69 383-397. 


ROBUST SCALE ESTIMATORS 1251 


HILL, R. W. and HoLuanp, P. W. (1977). Two robust alternatives to least squares regression. J. 
Amer. Statist. Assoc., 72 828-833. 

HoLianp, P. W. and WELscH, R. E. (1977). Robust regression using iteratively reweighted least 
squares. Commun. Stahst. A 6 813-827. 

HUBER, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statst. 35 73—101. 

HUBER, P. J. (1967). The behaviour of maximum likelihood estimates under non-standard conditions, 
In Proc. Fifth Berkeley Symp. Math. Statist. Prob. 1 221-233. Univ. California Press. 

HUBER, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Statist. 1 
799-821. 

JAECKEL, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of the 
residuals. Ann. Math. Statist. 43 1449-1458. 

LoYNES, R. M. (1980). The empirical distribution function of residuals from generalized regression. 
Ann. Stahst. 8 285-298. 

Pierce, D. A. and Kopecky, K. J. (1979). Testing goodness of fit for the distribution of errors in 
regression models. Biometrika 66 1-5. 

RELLEs, D. (1968). Robust regression by modified least squares. Ph.D. thesis, Yale Univ. 

RUPPERT, D. and CARROLL, R. J. (1980). Trimmed least squares estimation in the linear model. J. 
Amer. Statist. Assoc. 75 828-838. 

WELSH, A. H. (1985). An angular approach for linear data. Biometrika 72 441—450. 

Youal, V. J. and Maronna, R. A. (1979). Asymptotic behaviour of M-estimators for the linear 
model. Ann. Statist. 7 258—268. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF CHICAGO 

5734 SOUTH UNIVERSITY AVENUE 
CHICAGO, ILLINOIS 60637 


The Annals of Statistics 
1986, Vol 14, No. 3, 1252-1256 


ON A CONVERSE TO SCHEFFE£’S THEOREM 


By T. J. SWEETING 


University of Surrey 

In Boos (1985) equicontinuity conditions are given which ensure the 
uniform convergence of densities in R*, given convergence in distribution. In 
the present note we show that such equicontinuity conditions in fact char- 
acterize uniform local convergence with no additional assumptions on the 
sequence of densities, or on the limit density. Versions of these results are also 
given when the distributions depend on an unknown parameter; these forms 
will be relevant for the uniform approximation of likelihood functions. 


1. Introduction. In Boos (1985) it is shown that convergence in distribution 
entails continuous convergence of the corresponding densities under boundedness 
and equicontinuity assumptions regarding the sequence of densities. This result 
follows from the Ascoli theorem regarding the sequential compactness of families 
of functions. It is further demonstrated in that paper that a strengthening of 
these conditions leads to uniform convergence of the densities over the entire 
Euclidean space. The results are applied to the problem of proving local limit 
theorems for translation and scale statistics. 

The main purpose of the present note is to show that such equicontinuity 
conditions are also necessary for the stated convergences. Additionally, it is not 
actually necessary to assume continuity of the individual densities in these 
results, nor even the existence at the outset of a density for the limit distribution. 
When local limit results are required for the construction of likelihood functions, 
it is also necessary to have uniform convergence over compact subsets of the 
parameter space; this matter is discussed in Section 3. 


2. Uniform convergence of densities. We borrow the notation in Boos 
(1985): Let (G,,) be a sequence of absolutely continuous (wrt Lebesgue measure p) 
distributions on R*, and let (g„) be a corresponding sequence of densities. We 
seek necessary and sufficient conditions for the sequence (g,) to converge uni- 
formly to some density g. Since we do not assume here that the g, are 
continuous, we require the following slight modification of equicontinuity. As in 
Boos (1985) we let | - | on R* be the maximum absolute coordinate. We shall say 
that a sequence of real-valued functions (u,) on R* is asymptotically 
equicontinuous (a.e.c.) at x E€ R* if given e > 0, there exist 8(x, e), n(x, e) such 
that whenever |y — x| < 8(x, e) then |u,(y) — u,(x)| < e forall n > n(x, e). The 
sequence (u,,) is a.e.c. if it is a.e.c. at each point x of R*. Likewise, we say that 
(u,) is asymptotically uniformly equicontinuous (a.u.e.c.) if it is a.e.c. and 
S(x, €) = 8(e), n(x, e) = n(e) above. (These definitions actually coincide with the 


Received June 1985; revised November 1985. 
AMS 1980 subject classyications. Primary 62E20, secondary 60F99. 
Key words and phrases. Equicontinuity, uniform convergence of densities, uniform approximation 
of likelihood functions. 
1252 


CONVERSE TO SCHEFFE’S THEOREM 1253 


definitions of equicontinuity and uniform equicontinuity given in Boos in the case 
of continuous u,; the addition of the word “asymptotic” better describes the 
concepts, however, especially in the case of discontinuous u,.) Finally, the 
sequence (u,) is bounded if sup,|u,(x)| < M(x) < œ for each x € R*. The 
following theorems clarify the position in Boos (1985). We use the symbol = to 
stand for weak convergence. 


THEOREM 1. The following two statements are equivalent. 
(1) (g&n) is a.e.c. and bounded, and G, = G. 


(2) 8, > g pointwise, uniformly in compacts of R*, where g is 
the continuous density of the distribution G. 


Proor. (1) = (2). In the case of continuous densities (g,) and g, the result 
follows from the Ascoli theorem along with Scheffé’s theorem, as given in Boos 
(1985). However, it may readily be checked that the proof of the Ascoli result (as 
given in Corollary 31 of Chapter 9 in Royden (1968), for example) goes through 
for a general a.e.c. sequence (g,,) of (not necessarily continuous) functions with 
virtually no change; we omit the details. The convergence of any subsequence 
(£~) to a continuous limit g uniformly in compacts of R* implies that for any 
compact set K C R}, G,(K) > fxgdu. It now follows from the regularity of the 
measure given by H(A) = {,g du (Rudin (1970), Theorem 2.18) that H = G and 
hence g is the unique continuous density of G. The result now follows from the 
Ascoli theorem. 

(2) = (1). The following form of converse uses the local compactness of R*. 
Let x © R* and e > 0. Choose (x, e), n(x, e) such that |g(y) — g(x)| < € and 
l8n(¥) — eC y)| < £ whenever |y — x| < 8(x, e) and n > n(x, e) (from the contin- 
uity of g and uniform convergence of (g,,) in compacts). Then if |y — x| < 8(x, e) 
and n > n(x, £) we have 


(3) lgn(¥) — Enl) < lal) — 8(9)| + lgx) — glx) + la(y) — glx) 


< 3e 


and, hence, (g,) is a.e.c. Also, (g,) is trivially bounded as (g,) converges 
pointwise. The weak convergence G, = G follows from Scheffé’s theorem. 0 


Thus, apart from allowing arbitrary densities, the sufficient conditions in Boos 
(1985) for density convergence cannot further be weakened if uniformity in 
compacts of R* is demanded. We remark that it is not necessary to assume at the 
outset the existence of a density for G in the implication (1) = (2). When the 
sequence (g,,) is continuous, then of course asymptotic equicontinuity is equiv- 
alent to equicontinuity. Finally, we note that when (g,) is continuous the 
uniform convergence of (g,,) in (2) itself entails the continuity of the limit g. 

The next result treats uniform convergence over R*. 


1254 T. J. SWEETING 


THEOREM 2. The following two statements are equivalent. 
(4) (8n) is a.u.e.c. and bounded, and G, = G. 


(5) £, `> g pointwise, uniformly in R*, where g is the uniformly 
continuous density of the distribution G. 


Proor. (4) = (5). Since (g,) is a.ec. and bounded we know that g, > g 
pointwise from Theorem 1 where g is the continuous density of G. But 
(g,) a.u.ec. implies that given e> 0 there exist ô(e), n(e) such that 
ISn(¥) — &,(x)| < £ whenever |y — x| < (e) and n > n(e). Then if |y — x] < 8(e) 


le(y) - a(x)| = Jim Enl Y) — gnl )| < €, 


and hence g is uniformly continuous in R4. 

We give a slightly different proof of the uniform convergence of (g,,) to that in 
Boos (1985). In view of Theorem 1, it suffices to prove that (i) g,(x,) — 0 and (ii) 
&(X,) > 0 whenever |x,,| > œ. Consider first (i) and suppose to the contrary that 
there exists e > 0 and a sequence (x,) with |x,| > œ such that g,(x,,) > 2e 
along some subsequence (n’). From asymptotic uniform equicontinuity it follows 
that there exist (e) > 0, n(e) such that g„(y)> e whenever |y — x,,| < 8(e) 
and n’> n(e). Hence fiy—zy)<ae8y(¥) de > e[28(e)]* for each n’. Thus 
GyAL¥: ly] > [wl — 6(€)}) > €f28(e)]* and since |x| > 00 the sequence (G,) is 
not tight and hence cannot converge weakly. A similar argument using the 
uniform continuity of g shows that if (ii) fails then G is not tight and hence is 
not a probability measure. 

(5) = (4). We can choose 6(e) and n(e) such that |g(y) — g(x)| < e whenever 
ly — x| < 6(e) and |g,(y) — g(y)| <e for all y. Then if |y — x| < 6(e) and 
n > n(e) (3) holds and (g,,) is a.u.e.c as required. The boundedness of (g,,) and 
weak convergence of (G,,) follow as before. 0 


Note that the conditions that g be continuous and g(x,) - 0 as |x„| > œ 
stipulated by Boos (1985) are redundant, since the asymptotic uniform 
equicontinuity and boundedness of (g,,) entail the existence and uniform continu- 
ity of the limit, which in turn imply that g(x,,) > 0 as |x,| > œ. 

As remarked by Boos, it is more convenient to consider uniform convergence 
on compacts than pointwise convergence. In the latter case, it is possible to give a 
version of Theorem 1 by simply replacing the a.e.c and boundedness conditions 
on (g,,) in (1) by an appropriate sequential compactness condition. Such a result 
however is of little use, as it seems difficult to characterize compactness in the 
topology of pointwise convergence. 


3. Uniform approximation of likelihood functions. As remarked in Boos 
(1985), local limit results are especially important in statistics whenever one 
wishes to construct an approximate likelihood function based on some ap- 
propriate statistic. It would also appear to be essential that such an approxima- 
tion can be made uniformly in compact subsets of the parameter space Q. 


CONVERSE TO SCHEFFE’S THEOREM 1255 


Suppose then that the distributions G(x; 0) depend on an unknown parameter 
6 € R, which we assume to be a subset of R' (although the results below may be 
stated for an arbitrary separable locally compact metric space). Again, for each 0 
we assume G(x; @) is absolutely continuous with density g,(x; 0) (not neces- 
sarily continuous in either x or 6). The following version of Theorem 1 follows 
easily on replacing x by (x, 8) throughout in the proof of Theorem 1; otherwise 
the proof requires no change. 


THEOREM 3. The following two statements are equivalent. 
(6) (8n) is a.e.c. and bounded in R* x Q, and G,( ; 0) = G( ; 6) 
for each 8 € Q. 
(7) &, > 8 pointwise, uniformly in compacts of R* x Q, where 
&( ; 9) is the continuous (in R* x Q) density of g( ; 0). 


Note that since (6) = (7), the conditions in (6) also ensure that G,( ; 4) = 
G( ; 4) uniformly in compacts of R* x Q, and that G( ; @) is continuous in 
R* x Q. 

In practice, uniform convergence in R* will often be necessary for likelihood 
approximations, as G,(x; @) will usually be the distribution of some quantity 

u,(T,,, 8) where T, is a statistic. Then the density of T, is given by 


fal ts 0) = Bn(Un(t; 0); @)I(a/at)u,(t, 8), 
where g,, is the density of u„(T,, 9), and as @ ranges over compacts of Q, the 
values of u,(t,@) for given ¢ and all n will usually not be confined to some 
compact set. It will therefore be necessary to show that g, — g uniformly in 
R* x K for every compact K € Q. 


THEOREM 4. Let K be any compact subset of 2. Then the following two 
statements are equivalent. 


(8 (g,) is a.u.e.c. and bounded in R*X K and G,( ; 0) = 
) G( ; 0) uniformly in R* x K where G is continuous in 6. 


(9) En > & pointwise, uniformly in R* x K, where g( ; 8) is the 
density of G, uniformly continuous in R* x K. 


Note that (g,,) a.e.c. in R* X Q and a.u.e.c. in R* for each 8 is equivalent to 
(g,) a.u.e.c. in RX K for every compact K <Q. (This follows from the 
separability of Q.) Theorem 4 follows as a corollary to Theorem 2, given the 
following basic lemma. Let X and Y be two metric spaces. The sequence (u,,) of 
functions u,,: X —> Y is said to converge continuously to u if u,(x,) > u(x) for 
every sequence (x,) with x, > x for all x € X. 


Lemma. Let (u,) be an arbitrary sequence of functions u,: X > Y where X 
is locally compact. Then u,, > u, a continuous limit, uniformly in compacts of X, 
if and only if u, > u continuously. 


1266 T. J. SWEETING 


The lemma follows from standard arguments, and the proof is omitted. 
Theorem 4 now follows from Theorem 2 on taking convergent sequences (6,) and 
writing g*(x) = g,(x; 6,). As an example, consider a translation statistic T, 
satisfying the conditions of Theorem 1 in Boos (1985). Then if T, is also a scale 
statistic, it is easily shown, following the proof in Boos, that (8) holds with 
G,( ; 9) the distribution of n'/7(T, — u) and 6 = (p, o), where p is the target 
parameter of T,, giving the desired uniform density convergence in (9).- 


REFERENCES 


Boos, D. D. (1985). A converse to Scheffé’s theorem. Ann. Statist. 18 423-427. 
Roypen, H. L. (1968). Real Analysis. Macmillan, New York. 
Rup, W. (1970). Real and Complex Analysis. McGraw-Hill, New York. 


DEPARTMENT OF MATHEMATICS 
UNIVERSITY OF SURREY 
GUILDFORD, SURREY GU2 6XH 
ENGLAND 


k 


The Annals of Statistics 
1986, Vol. 14, No. 3, 1257-1260 
THE EFFICIENCY OF GOOD’S NONPARAMETRIC 
COVERAGE ESTIMATOR 


By WARREN W. Esty 


Montana State University 


The asymptotic efficiency of Good's nonparametric coverage estimator is 
obtained relative to the best estimator derived under the assumption that all 
classes are equally likely. Even when that assumption is true, Good's estima- 
tor is quite efficient, with an asymptotic relative efficiency of greater than 85% 
in all cases, and greater than 95% if the expected coverage is less than 
one-half. 


1. Introduction. The coverage, C, of a random sample of size n from a 
multinomial population is defined to be the sum of the probabilities of the 
observed classes. Estimating the coverage of a sample from an unknown multi- 
nomial distribution is an occupancy problem with applications in ecology (Good 
and Toulmin, 1956; Engen, 1974; Engen, 1978), vocabulary studies (Efron and 
Thisted, 1976; McNeil, 1973) and archaeology (American Numismatic Society, 
1974; Esty, 1982, 1983). Good (1953) introduced an estimator for C, 

(1.1) C=1-N,/n, 

where N, denotes the number of classes observed exactly once, which has received 
much attention (Robbins, 1968; Engen, 1978; Starr, 1979; Chao, 1981; and many 
others). Esty (1983) found the associated confidence intervals under very general 
conditions. Numismatists regularly use estimators derived under the hypothesis 
that all classes are equally likely, which is the hypothesis of the classical 
occupancy problem (Feller, 1968; Johnson and Kotz, 1977, Section 6.2.1). Carter 
(1981) compared several of these (Lyon, 1965; Brown 1955/57; Carcassone, 1980; 
and Mora-Mas, 1981; Schroeck, 1981, has given another one since then) by 
evaluating them using real data where the classes are varieties of ancient coins, 
but no theoretical comparison of Good’s estimator to other commonly employed 
estimators has been given. Users of these intervals will ask if the generality of the 
nonparametric approach is accompanied by a substantial loss in efficiency rela- 
tive to the methods already in use. The answer is “no”; Good’s estimator is 
remarkably efficient. When all classes are equally likely, the asymptotic efficiency 
of C relative to the best estimator based on the hypothesis that all classes are 
equally likely exceeds 85% in all cases and exceeds 90% if the expected coverage is 
below 76%. An explicit formula for the asymptotic relative efficiency is given by 
Theorem 3. 


2. Discussion and results. Note that the coverage of a sample is not a 
parameter of the population. Therefore an “estimator” of the coverage is not an 


Received January 1985; revised November 1985. 
AMS 1980 subject classyications. Primary 62G20; secondary 62F12. 
Key words and phrases. Coverage, total probability, occupancy problem, unobserved species. 


1257 


1258 W. W. ESTY 


estimator in the usual sense and the “efficiency” of an estimator of C cannot be 
defined in the usual sense, but it is possible to define the asymptotic relative 
efficiency of two estimators by comparing the variances of their associated 
normal limit theorems in the usual manner. Good’s nonparametric estimator will 
be compared to the most restrictive parametric estimator, namely, the estimator 
based on the equally likely hypothesis, when the equally likely hypothesis is true. 

First we need the normal limit law for the estimator based on the equally 
likely hypothesis. Suppose n balls are distributed at random into k boxes. Let D 
denote the number of occupied boxes. The asymptotic behavior of D is well 
known as n > œ and k — œ such that n/k —> m > 0 (Geiringer, 1938; or see 
Johnson and Kotz, 1977, Chapter 6.1). If k is unknown, D is a sufficient statistic 
for k (Darroch, 1958). Asymptotically, the estimator for k, Y, is given by the 
solution of 


(2.1) D=Y(1-—e7"/"). 
Now C = D/k, and the corresponding estimator for C is 
(2.2) é = D/Y. 


The associated normal limit law is: 


THEOREM 1. Tf all classes are equally likely and n > œ and k > œ such 
that n/k > —I\n(1 — ¢)},0 < c <1, then 
c?[—(1 — e)In(1 — c)] 


nC - €) >p N|0, e+(1-c)In(l-c) J 


COMMENT. Note that c has been chosen such that E(C) > c and C >p c. 
The corresponding result for Good’s estimator is: 


THEOREM 2. If all classes are equally likely and n > œ and k > œ such 
that n/k > ~—\In(1 — c),0<c < 1, then 


n'(C — Č) >p N(0,(1 — ¢)(c — In(1 — ¢))). 

THEOREM 3. If all classes are equally likely and n > œ and k > œ such 
that n/k > —In(1 — ¢),0 <c < 1, then the asymptotic relative efficiency of 
Good ’s estimator to the estimator based on the equally likely hypothesis is given 
by 

" (m0 ~ 0))e? 
~ (e+ (1 — e)in(1 — c))(e — (1 - ¢)) 


COROLLARY. (a) Asc > 0, E > 1. (b) Asc > 1, E > 1. (c) E > 0.85 for all 
c,O<e<1. 


For example, if c = 0.1, then E = 0.9913 and confidence intervals based on 
Good’s estimator are asymptotically only 0.45% longer than those using the 


EFFICIENCY OF GOOD’S ESTIMATOR 1259 


equally likely hypothesis. If c = 0.5, then E = 0.9466 and intervals are 2.8% 
longer. If c = 0.9, E = 0.8736 and intervals are 7% longer. The minimum value of 
E, 0.8516, is attained at c = 0.978. Thus Good’s estimator is quite efficient. 


3. Proofs. Since D = k — Nj, where N, denotes the number of unoccupied 
boxes, D is asymptotically normal with mean and variance asymptotic to 


(3.1) kl e*/*): and ke na(n = (njk)e**), 
respectively (Johnson and Kotz, 1977, page 317, number 3, where an m = k is 
missing on the right side). Since C = D/k = 1 — N,/k, E(C) > c and C pc. 
To prove Theorem 1 note that 
D D D Y-k 
1/2000 oe pte ee a 
(3.2) nv¥27C-C)=n (5 z) y” r 
From (2.1), D/Y ->p c, also. Thus in (3.2) it remains to determine the limit law of 
nY — k)/k. In (2.1), treating Y as a function of D, Y'= [1-—e7"/¥ — 
(n/Y)e7”/ Y7}, by implicit differentiation. Let d = k(1 — e~"/*), Y(d) > 
[e + (1 — c)ln(1 — c)]~). Expanding Y(D) about d in a Taylor series, 
ni? n -1n 
ee ( YS = — e-n/k _ —e—n/k ——(D— 
7 (Y-k) h e ze | ; ( d) 
3.3 
(8.3) i 
+ z O((D - d)’). 


Now, (n!/*/k)(D — d) = (n™/kXD — E(D) + E(D) — d). Note that 
n'/2(E(D) — d)/k > 0. Using (3.1) and n/k > —ln(1 — c), (3.3) is 
asymptotically normal with mean 0 and variance (—(1 — c)ln(1 — c))/ 
(e + (1 — c)ln(1 — c)). The D/Y factor in (3.2) contributes the extra factor of c?, 
and Theorem 1 is proven. 

Under the hypotheses, Theorem 2 follows easily from Theorem 4 of Esty 
(1983), since E(N,)/n = (E(N,)/kXk/n) ~ (n/k)e“"/*(k/n) > 1—e, and 
E(2N,)/n ~ (n/k)’e-"/*(k/n) > (—In( — e)a — e). 

Theorem 3 follows immediately from Theorems 1 and 2. Corollary (a) is 
obtained from the Taylor expansion of In(1 -- c). Corollaries (b) and (c) are 
straightforward. 


4. Conclusion. Good’s nonparametric coverage estimator, which is ap- 
propriate for a wide variety of multinomial distributions, is remarkably efficient 
relative to the best coverage estimator developed under the strong hypothesis 
that all classes are equally likely, even when that hypothesis is true. Thus the 
advantages of its wider validity serve to recommend the Good estimator for the 
coverage of a sample even if the classes are approximately equally likely. 


REFERENCES 


AMERICAN NUMISMATIC SOCIETY (1974), Estimating die and comage output: Bibliography. Mimeo- 
graphed report. 


1260 W. W. ESTY 


Brown, I. D. (1955 / 57). Some notes on the coinage of Elizabeth I with special reference to her 
hammered silver. Brit. J. Numus, 28 568-603. 

CARCASSONE, C. (1980). Tablés pour l'estimation par la méthod de maximum de vraisemblance du 
nombre de coins de droit (ou de reverse) ayant dervi à frapper une émission. In Sym- 
postum Numusmatico de Barcelona. Asociación Numismatica Española, Barcelona. 

CARTER, G. (1981). Comparison of methods for calculating the total number of dies from die-link 
statistics. Statistics and Numismatics (C. Carcassone and T. Hackens, eds.) 204-213. 
Council of Europe, Strasbourg. 

Cuao, A. (1981). On estimating the probability of discovermg a new species. Ann. Statist. 9 
1339-1342. 

EFRON, B. and THISTED, R. (1976). Estrmating the number of unseen species: How many words did 
Shakespeare know? Biometrika 83 435-447. 

ENGEN, S. (1978). Stochastic Abundance Models. Halsted, New York. 

Esty, W. W. (1982). Confidence intervals for the coverage of low coverage samples. Ann. Statist. 10 
190—196. 

Esry, W. W. (1983). A normal limit law for a nonparametric estimator of the coverage of a random 
sample. Ann. Statıst. 11 905-912. 

FELLER, W. A. (1968). An Introduction to Probabuity Theory and Its Applications 1, 3rd ed. Wiley, 
New York. 

GEIRINGER, H. (1938). On the probability theory of arbitrary linked events. Ann. Math. Statist. 9 
260-271. 

Goop, I. J. (1953). The population frequencies of species and the estimation of population parame- 
ters. Biometrika 40 237-264. 

Goon, I. J. and Tounmin, G. H. (1956). The number of new species and the increase in population 
coverage, when a sample is increased. Biometrika 43 45-63. 

Jonnson, N. L. and Korz, S. (1977). Urn Models and Their Application. Wiley, New York. 

Lyon, C. S. S. (1965). The estimation of dies employed in a coinage. Numus. Circ. 73 180-181. 

McNeEn., D. R. (1973). Estimating an author’s vocabulary. J. Amer. Statist. Assoc. 68 92—96, 

Mora-Mas, F. (1981). Estimation du nombre de coins selon les répetitions dans une trouvaille de 
monnaies. In Statistics and Numismatics (C. Carcassone and T. Hackens, eds.) 173-192. 
Council of Europe, Strasbourg. 

Rossins, H. (1968). Estumating the total probability of the unobserved outcomes of an experiment. 
Ann. Math. Statst. 39 256-257. 

SCHROECK, F. E. JR. (1981). Tabulated results on the estimation of the number of dies of a coin. 
Numus. Curc. 89 37-40. 

STARR, N. (1979). Linear estimation of the probability of discovering a new species. Ann. Statist. 7 
644-6652. 


DEPARTMENT OF MATHEMATICAL SCIENCES 
MONTANA STATE UNIVERSITY 
BOZEMAN, MONTANA 59717 


The Annals of Probability 
Vol. 14 October 1986 


Memorial Articles 


Mark Kac 1914-1984 ... 

The influence of Mark Kac on probability theory 

The contributions of Mark Kac to mathematical physics 

A life of the immeasurable mind. A review of Enygmas of Chance, 
An Autobtograph: by Mark Kac 

Publications of Mark 


Isotropic stochastic flows ~ -PETER BAXENDALE AND THEODORE E. HARRIS 
A spectral criterion for the finiteness or infinıteness of stopped Feynman--Kac 
functionals of diffusion processes 
Tıme reversal of diffusions 
A one-dimensional diffusion process in a Wiener medium 
Sur la saucisse de Wiener et les points multiples du mouvement Brownien 
JEAN-FRANCOIS LE GALL 
Tanaka’s formula and renormalization for intersections of planar 
Brownian motion JAY ROSEN 
Answers to some questions about increments of a Wiener process 
CHEN GUIJING, KONG FANCHAO AND LIN ZHENGYAN 
Regenerative representation for one-dimensional Gibbs states ............ S. P. LALLEY 
A sufficient condition for association of a renewal process 
ROBERT M. BURTON, JR. AND ED WAYMIRE 
pis hat le exclusion process as seen from a tagged particle PABLO A. FERRARI 
it theorem for the contact process ROBERTO HENRIQUE SCHONMANN 
Caen results for existence of moments and uniform integrabili ty 
for stopped random walks ALLAN Gur AND SVANTE JANSON 
The spectral radius of large random matrices 
Empirical processes indexed by Lipschitz functions 
On the rate at which the sample extremes become independent ..M. FALK AND W. KOHNE 
Limit theorems for a trian scheme of U-statistics with applications 
to inter-point distances ........... S. Rao JAMMALAMADAKA AND SVANTE JANSON 
Central limit theorems for mixing sequences of random variables under minimal 
conditions HEROLD DEHLING, MANFRED DENKER AND WALTER PHILIPP 
Asymptotic normality for a general statistic from a stationary sequence 
EDWARD CARLSTEIN 
A characterization of the spatial Poisson process and changing time 
ELY MERZBACH AND DAVID NUALART 
Compound Poisson approxmations for sums of random variables .. . RICHARD F. SERFOZO 
On a lower bound for the multivariate normal Mills’ ratio SATISH IYENGAR 
Stochastic determination of moduli of annular regions and tori . . . . HIROSHI YANAGIHARA 
Recurrence of random walks on completely simple semigroups........... P. B. CERRITO 
A finite form of de Finetti’s theorem for stationary Markov Techangeatalliy © . ARIF ZAMAN 


Book Review 


Two books on large deviations: Entropy, Large Deviations, and Statistical Mechanics 
by Richard S. Ellis and An Introduction to the Theory of Large Deowations 
by D. W. Stroock PETER NEY 





o 





“When we read what their college will cost...we 
bought the maximum life insurance available: 
inthe IMS Members’ Insurance Program..’ 


"...Oh, we expect to be there—proud as Can s Check your insurance portfolio. Will it meet 
became they enter college. But, just in case your family’s future needs? If not, call or write 
one of us isn't. ..the insurance coverage willbe.” the Administrator. 

Hf a breadwinner is no longer around to provide 
4 a UP TO $240,000 IN 


for family shelter and the children’s college l TERM LIFE INSURANCE PROTECTION 


education.. .life insurance had better be. 
Our group term life insurance can supplemef IS AVAILABLE TO IMS MEMBERS. 
your insurance portfolio so that your family has 
§ the protection it needs now—and for the future. 
As a member, you can fill your insurance 
needs at low group rates. You can get family 
coverage, too. And if you change jobs, your 
group insurance automatically goes with you, so 
there's nc lapse in your insurance coverage. 





‘ oe Contact Administrator, 
+ i IMS Group 

k Smith-Sternau nization, Inc 

1255 23rd Strest, 

Washington, DG 20037 5 sy 

800 424-9883 Toll l Free 


mn Washingeon OC ares ROR: 888.0030 ‘oe 


